First Time Screwing up Remote Servers

edit: 2016-09-22

Updated with the joke from nixcraft

I guess there is always a first time for everything. When I first heard of jokes like rm -rf / and laughed at how ridiculous it was, little did I know that these kinds of mistakes, though not as severe, could happen to most people. So instead of the usual long technical blog post, I will briefly document how I screwed up our servers at Linux User Group remotely. Though I did not cause any trouble as permanent as wiping out our disk, my mistake caused hours of downtime for our servers.

I sshed into the remote server as root (yes, you know how the story goes) to configure our ssh service. This physical server holds 2 other virtual machines, which are critical to the running of our website. How networking on this server works gets really interesting. Since both VMs have their own public IPs, the physical server has to sort out traffic forwarded to itself and the 2 virtual servers. The 2 virtual servers runs on Xen hypervisor, and here comes the xenbr0 network interface.

A quick run of ifconfig command showed the xenbr0 interface, which I supposed to handle packets forwarded to VMs, but I was unable to find the default eth0 interface on the host physical server. So I thought how about temporarily bringing the interface down, since it's easy to bring it back up anyway. (stupid, stupid idea)

xenbr0    Link encap:Ethernet  HWaddr XX:XX:XX:XX:XX:XX
          inet addr:XXX.XXX.XXX.XXX  Bcast:XXX.XXX.XXX.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

ifconfig output concerning xenbr0 interface, some info omitted

So I ran ifconfig xenbr0 down on this physical server and waited, but the command was unresponsive for a long time. Then I realized something was going wrong. Looking at the xenbr0 interface, I saw that its inet addr has the same addr as the physical server itself, which means that xenbr0 handles all incoming traffic to both the physical server and the 2 VMs. Now I was in a big trouble -- I just brought down the network interface for the physical server. No wonder the command has no response. All 3 servers, the physical server and the 2 VMs were down now. I would have to ask my friend/colleague to re-enable it at the console.

So the story ends. Any takeaway messages? I will quote my friend on this.

Protip: if you're gonna ssh into a machine and turn off network interfaces, you're gonna have a bad time.

The other day, I came across this joke from nixcraft, which perfectly illustrates the mistake I made here, though in terms of sshd.

I’ve never locked my keys in my car, but I did once kill sshd and that’s basically the same thing.

Comments