rebooting servers

Earlier this month, Paul Venezia of InfoWorld wrote a blog post post “Nine traits of the veteran Unix admin” and this week’s follow-upĀ  “When in doubt, reboot? Not Unix boxes” that’s garnered almost 700 comments on Slashdot.

It was interesting reading and there were the required remarks poking fun of rebooting Windows servers as a primary diagnostic tool. Having managed both Unix and Windows sysadmins over the last fifteen years, I can agree that the Unix admins are *generally* a bit more adept at troubleshooting. Since Unix has run bigger, mini-computer type boxes well before Windows Server was even a player, many admins rarely had a reboot option available – which forced them to be more resourceful. Linux has changed that mindset, as many of its administrators started out with Wintel.

His comment that “Rebooting is almost never an option” is a bit extreme and I suspect he took that stance to make the point. All things being equal, you are indeed better off troubleshooting and conducting a post mortem prior to rebooting and erasing evidence of the cause.

Obviously, if the reboot is a workaround that can restore service, that takes precedence. Additionally, as a manager trying to efficiently use and prioritize limited resources, I know that on a growing number of occasions, a rebuild is a better option than a repair, and I may not want to spend the time troubleshooting. That may disturb the purists and it may prevent the real problem from ever being resolved, but that may be the best option from a service management, business-oriented perspective.

There is also value in the periodic, maintenance reboot. Unix guys may argue that Unix doesn’t require a “cleaning out” as some operating systems do. To me, that argument is moot. Just because a server can run uninterrupted for three years doesn’t mean it should. It’s just good practice to be prepared for the inevitable time when it has to get powered down or forced down. System admins come and go, taking with them configuration knowledge, and daemons are started or stopped and changes made that are not documented. And hardware components die. I would much rather address these issues periodically with planned quarterly or semi-annual reboots than in an unplanned or chaotic moment.