Biggest Single Point of Failure

I see a good number of IT shops with my job and in most cases the largest priority is system uptime. I might be there to install, troubleshoot, etc. but in the front off my mind is the idea that everything must stay up and running.

IT departments are adding redundant WAN connections, server clusters, fault tolerance, failover devices, disaster recovery sites and redundancies at every level. But in some cases these departments are forgetting a pretty integral part of continuous uptime.

Meet the forgotten single point of failure.

Don’t forget about the System Engineers or Administrators. These guysgirls are there on almost a daily basis making sure that the redundancies are working correctly, running processes and monitoring the systems. Many companies can’t do without their “IT guy” for more than a day or two. This isn’t fair to the employee, and can be a serious risk to a company’s infrastructure.

Think about it, how many people know how to fail over your production site to the disaster side if it happens? Do you want to leave such an important role in the hands of one person? Especially DR, since there is a chance that in a disaster, this person might not be able to do work stuff. People have families and homes that might take priority over work depending on the disaster.

Let’s face it, I’m sure that these companies would like to have additional employees but it comes down to a cost thing. If this is the case, documentation is an absolute must. All procedures should be documented so that anyone can run the IT shop. Then again, if the engineer is spending this much time on level of detail, you might need another engineer in the first place!

Good Engineers can be expense, but how much more expensive is it to lose data or business because of this single point of failure.