1 Part Corps of Engineers, 1 Part Secret Service

February 11, 2010

I’ve been thinking a lot lately about the role of System, Network and Database Administrators, and I’ve found it’s useful to think of us, collectively as the “Operations” arm of an organization.  Whether I think of it as “Systems Operations”, “Network Operations” or some other variant doesn’t really help me understand our role any better, but could be useful to differentiate it from some other kind of operations department.

What I have found helpful is to think about our responsibilities and the functions we perform.

The model I have at the moment is that we are 1 part US Army Corp of Engineers, and 1 part US Secret Service.

We are like the Corp of Engineers because we build infrastructure.  We survey problem sites.  We design solutions for the environments that we support.  This requires us to plan ahead, so that we are not short on resources, and so that our work is done in coordination with other efforts, so there aren’t costly delays.  We need to do requirements planning, and do peer reviews to ensure we do not have oversights in our plans that will lead to defects, and the implementations need to be inspected for defects as things are built.

If defects are found, we have to determine how much it impacts the project, and how much it will take to repair it.  If it will lead to a catastrophic failure, it is our duty to report it and try to raise awareness about the seriousness of the issue, so that the catastrophic failure will not occur.

When these steps are not followed, natural disasters, whose exact timing could not be predicted, but whose eventually can predicted from a long-term understanding of the environment and it’s use and rate of decay.

This feels like a very useful analogy to me, and I think I can use it as a value system for making judgments.  “Should I do this?”   “What should I do?”  are difficult questions on their own about any topic, but having a clear roadmap when I can imagine what I would expect out of the Corp of Engineers, just for my own and others safety, I can see the similarities and come to decisions for how best to help my organization.

Operations also serve another purpose though, which I think of as similar to the Secret Service in many interesting ways.  The Secret Service has many different jobs, but primarily they consist of protecting the President, protecting the emergency military response gear, protecting areas of Executive National Security, and they also deal with counterfeiting.

The Secret Service has the responsibility of last say in terms of security of the President and Executive interests.  They are supposed to veto any plan that puts the President in unnecessary danger, and while they may be over ridden at times, the domain of maintaining the proper security belongs to them.

The Secret Service plans for failures, and drills constantly to ensure they are ready for unexpected events and stay calm and respond according to a plan that has evolved from many previous experiences.

Finally, they have the responsibility to physically manhandle the President if an emergency situation breaks out.

I see a lot of parallels between Operations staff and the Secret Service.  We are responsible for ensuring that no one enters our systems without authorization.  We are responsible for ensuring that are data, customer data, and financial data are secure from being accessed by intruders.

We have to use certificates of authority, and secret handshakes with publicly and privately known information to allow our agents to communicate.

We are the ones who will first encounter problems with code on websites, or failures in hardware, and must plan and carry out an evacuation of any data that might be lost, or redirect traffic so that our customers aren’t sent to a broken service.

If attackers manage to enter our systems, we might isolate and quarantine them quickly, and thoroughly understand how they entered so we can stop it from being possible again, and find any similar vectors of attack.  We must then go through all our systems that can’t immediately be burned and resurrected, and ensure that they are not infected, or plan for a rapid migration to be able to clean the system and restore the safe perimeter of our organization.

Finally, I don’t think we need to tackle any of our management, but I think we do need to report our findings to them and impress upon them the accurate level of importance and urgency to each one, according to it’s impact on the business, and preferably with as much detail as how it is likely to impact the business and any hard facts and data that will back this up to help them make alterations in their plans based on how well the previous plans are working.

What I like about these models is that they allow me to more easily imagine what I should be doing, and how I should be acting, not in a cop-drama way, but as an illuminated path towards the responsibility my organization could use from me.