Unique Organizational Glue

February 16, 2010

Glue is the most important part of any organization, and the part which is always unique to every organization.  Understanding this, and knowing how to make and apply glue will make the difference between a smooth running organization and an organization that is constantly firefighting and often working against itself.

In more traditional organizations, glue was process.  When you wanted to create cohesiveness between your employees and departments, you needed to create a process, train the players in performing the process, and then have a combination of rewards and punishments for not following the processes that glue your organization together.

Human processes will never be removed, because we are humans and will always have needs too subtle and changing, and human, to be automated.

That said, many things in today’s organizations can and should be automated, and this is the glue with which I spend most of my time making and thinking about.

The thing about organization glue is that it is always unique to your organization.  You can buy off-the-shelf glue, but you still have to apply it uniquely, and take care of it uniquely, and train people how to use it uniquely, because no other business does exactly what your business does, with the exact people and structure your business has.

So whether you are a Buy-Everything-Microsoft-Makes shop or a Build-Everything-Myself-From-Open-Source shop, you are still configuring everything uniquely, to solve your unique problems.

Herein lies the reasons more expert operations people choose Linux and other *nix environments, because in these environments you are expected to make your own glue.  All the components may come readily available, and many of the glues are already pre-mixed, but you are still expected to figure out where to put it, how to configure it, and probably to write your own custom glue code to connect piece X to piece Y, because they don’t quite line up.

In a pre-packaged environment, much of this has been done for you, many processes have already been worked out, and you are expected to implement them to specification.  Certificate programs are created to align workers with the commercial packages they support to enforce these “best practices”.

The trouble comes when these pre-made glue stamps fail to meet all your organizations needs, and then you must create custom glue.  In an environment that expects or requires custom glue, this has a steep learning curve, but is expected and encouraged.  In an environment where everything is supposed to be planned for you to implement, it is very difficult to add your own processes, and agility is lost at the benefit for having the majority of your solution come out of a box.

Working around these unique elements, while still working with the system to not subvert the benefits you received from purchasing it, create an extremely difficult situation made worse by those who are not capable or do not believe in custom solutions.

The real problem is one of expectations.  As a unique service provider, your business will do some things uniquely.  If you are in an industry where your service is all taken care of by humans, then your internal operations may well be simple enough to use off-the-shelf glue, and it will work well enough.

In the complex and ever changing world of internet software companies, this is not the case, and never has been.  And yet, many people still do not understand that their organization requires custom glue, that their processes will not simply connect together, and that by leveraging the abilities of their senior staff to create custom solutions, that mix in with existing open source and purchased solutions, they can find an optimal balance between buying and building, that they never really had a choice between anyway.

You simply can’t buy it all, because no one sells “Your Business In a Box”.  It’s up to you to build your business, and if you do it well, it will seem like it fits in a box.  If you do it poorly, it will seem like a combination of post-Katrina wasteland and a forest fire.

Either way, it pays to understand that your business has unique goals, and that it will take unique glue to bind your employees and departments together to achieve those goals.


remlite on the path to R.E.M.

February 15, 2010

Messing around with names for my projects a bit.  I originally created Red Eye Monitor to manage cloud systems, then I expanded the design to manage multiple cloud vendors and private data centers in an integrated fashion.  Then I made a newer-lighter version called remlite, so that I could have something that takes much less time to update, but still has understanding of different services and deployments (prod, staging, QA), so it would work in a larger environment than the original software, but didn’t take as much forethought as Red Eye Monitor’s integrated multi-cloud and datacenter took.

Now remlite has so many improvements over REM, specifically it’s very cool dynamic script running system which integrates any scripts being run with monitoring, graphing and alerting, and it’s equally cool simplification of RRD graphing (which I have always found a pain to set up, but is now fast and easy).

Because remlite is running so smoothly, and has so many good non-organizational features, I’ve decided to make it the core of the Red Eye Monitor project.  Since remlite runs on YAML files and REM runs on a fairly massive MySQL table structure I am going to split them up, so you will always be able to run REM off of YAML files, for a simple case.

The cloud infrastructure is going to be broken out so that wrapping Amazon EC2 API calls and creating your own in-house cloud, or connecting to other cloud vendors will sit in it’s own project.  Temporarily I’m thinking about this as the Red Eye Monitor Cloud, which will abstract all things cloudy.  The requesting of instances, storage, load balanced or external IP addresses, etc, will all be wrapped into the Red Eye Cloud package, which will be stand-alone and contain a cache for read commands so failure to connect to APIs will still result in expected usage.

remlite, which is all the management of interfacing with the cloud, and defining deployments and services to do so, will become Red Eye Monitor.  This is the core package, and will depend on Red Eye Monitor Cloud for interfacing with EC2 or any other cloud vendors (including an internal  Home Cloud).

The old REM specification, with all it’s massive data modeling of your physical and virtual environment, will be known as Red Eye Control.  This will be able to be used as a stand-alone system that could act as the organizational brain for any operations center, and not interface with Red Eye Monitor.

Finally, all the HTTP and XMLRPC stuff I built into REM is being turned into a standalone or embedded web server called dropSTAR (Scripts Templates and RPC).  This is a Python based HTTP/S and XMLRPC server that can run multiple listening thread pools on different ports, and maps page paths to a series of scripts, much like the format remlite scripts, to render data to a webpage or an RPC response.

So the final package list will look like this:

  • Red Eye Monitor (REM).  Currently “remlite”, this will be a YAML configured system for managing services and deployment in a cloud.
  • Red Eye Cloud.  Required by REM, wraps commands for any number of cloud environments, including your Home Cloud hardware wrappers.  Amazon EC2 is included.  This can be run as a library or stand-alone as well.
  • Red Eye Control.  This is a massive brain for your operational environments.  It will track every piece of hardware, down to their components, and all media connections between components to give you a complete understanding of your current infrastructure, and provides a comprehensive dependency graph for alert suppression.  In a standard REM configuration, Red Eye Control will be the source that is used to create all the YAML files that run Red Eye Monitor.  This will separate the brain data from the control scripts, but still allow one to drive the other.
  • dropSTAR.  HTTP and XMLRPC server.  Integrates into any Python program to provide threaded web server with easy Web 2.0 functionality.  Much work has been done to work out a system to easily create dynamic pages and interact and update them from any long-running Python program.  Also works as a drop-in web server, which is much easier to throw onto a non-web system to provide insight into what is going on.

I may also break out the RRD automatic creation and updating, as this is such a hard thing to get right, and being able to dynamically throw any data into RRDs with minimal specification is very useful.  I have to figure out how to do this still, and I’ll probably start looking at it after these other projects are completed and launched.

The Importance of Reinventing the Wheel

February 15, 2010

“Don’t reinvent the wheel” seems like pretty common sense advice, but is it good advice?

What kind of wheel do you use for your car?  My wheels have aluminum alloy inner wheels, with spokes to reduce the amount of material that needs to be rotated to reduce rotational inertia.  The outer wheel, the tire, is vulcanized rubber wrapped around steel mesh to give additional structure to the rubber to improve its durability, and the vulcanization allows the rubber to remain firm and not break apart even as it heats up.

The treads in the tire are numerous, and were designed to allow water to flow in and around the tire as it rolls into contact with the ground and then rolls back out of contact again.  This keeps the maximum amount of friction with the road, so that my car can get traction and propel itself forward or backward.

When did this wheel stop being reinvented?  It appears to me it never has, and according to my material engineer friends, and proven in amateur practical attempts, tire technology is far behind in being keep cars firmly on the road at the speeds that motors and drive shafts are able to propel cars.  Wheels today are vastly different than the wooden spoke wheels of past centuries, or the rock wheels of past millennia.

Does the tire industry use the phrase “don’t reinvent the wheel”?  I doubt they do.  Yet, this phrase is used all the time in technology companies, even as software and hardware is known to be among the most volatile of technology in terms of its’ pace of change.  Why is this?

One reason is that it’s a simplification of “don’t overly complicate things”.  If you are performing a simple job, use tools that exist and get it done so you can move onto the next job.  If software exists that properly does a job, use it and move on.

What about when software does not properly do a job?  What about if it does the job, but poorly and requires significant maintenance requirements or causes any future changes to be looked at with fear of breaking the running system and thus to be avoided?

I believe this is the time when the “don’t reinvent the wheel” is the least likely to be useful.  All progress requires reinvention of “wheels” all the time, or things would not be progressing.  The real question is “is this worth our time and money?”  This is a question that is always appropriate and can always supersede a “let’s not reinvent the wheel” simplification.

When is it time to reinvent the wheel?

  1. When you want your wheels to move faster.
  2. When you want your wheels to last longer.
  3. When you want your wheels to provide better traction, especially moving fast and taking corners.
  4. When you have the capability to make a better wheel, and the time and money to do so.
  5. When cost-benefit ratio is worth it.

Applying this thinking to your business provides a useful metaphor for when it is time to reinvent a wheel, and when it is time to use the wheels that already exist.  The metaphor has many built in direct comparisons, speed can be exchanged for volume or turn-around time.  Lasting longer maps well to maintenance cycles.  Being able to take turns at speed maps to being able to make new business goals and have your organization and software change to meet the new goals.

When I go to buy actual wheels for my car, I don’t develop my own tires, grow my own rubber trees, mine or produce my own metals.  I am not skilled enough at any of these things to make an improvement on the wheels that are sold by existing commercial organizations.  I also could not do it for anywhere near the costs of buying a new wheel and tire.  I would have to buy ore, create a factory or set one up at home (probably illegally), and it may take me years to create a usable wheel and tire combination to use, and they would almost certainly be of far lower quality than the worst tires and wheels I could purchase.  This is clearly a poor option, and purchasing a wheel provides many benefits.

In technology, things work similarly, the most similar being hardware, which has very similar creation processes.  You can now outsource your fabrication, but the nature of physical electronic development is extremely difficult, error prone, and even getting working hardware out of your outsourced fabrication plants and into your customers hands can be so tricky that even businesses with working hardware specifications can go out of business while trying to get their devices manufactured and in their hands to sell.

Software has the enormous benefit of being totally virtual.  Software merely has to attach properly to the environment it runs in (OS, drivers, libraries), fit in it’s physical resource constraints (storage, network and memory), and be internally consistent to provide a desired functionality.

Software provides one of the most obvious places to reinvent the wheel, because software is a series of commands to do what you want, and what you want is often different under different circumstances.  The same software wheel cannot provide you all the different results you want without being reinvented to update its’ internal logic and data to your desires.

Many pieces of software, say the Apache HTTP server, are so generic and customizable that they become ubiquitous in internet based software environments.  The original purpose of Apache was to deliver static content, in the form of HTML formatted text and image files, and later to allow running executable programs whose results would be returned instead of the static content.

Over the years, our desires for what software will give us has changed dramatically, and Apache has changed dramatically too, but still does essentially the same job.  Apache was once at the heart of what a web server was, and now it is merely a window that functions to keep requesters on one side, and the producers on the other side, while being mostly transparent, just routing information through from requester to producer and back again, with some access and redirection rules.

Some organizations have done away with Apache, or only use it to deliver static text and image quickly, and then all other requests are sent elsewhere.  The wheel of web request serving has been rewritten, but has it been rewritten for the last time?

That is unlikely, and all that is needed before the next time you find yourself needing to reinvent the wheel is a goal that can’t be met with current technology in a satisfactory manner for the goals you wish to accomplish.

Time and money permitting.  🙂

1 Part Corps of Engineers, 1 Part Secret Service

February 11, 2010

I’ve been thinking a lot lately about the role of System, Network and Database Administrators, and I’ve found it’s useful to think of us, collectively as the “Operations” arm of an organization.  Whether I think of it as “Systems Operations”, “Network Operations” or some other variant doesn’t really help me understand our role any better, but could be useful to differentiate it from some other kind of operations department.

What I have found helpful is to think about our responsibilities and the functions we perform.

The model I have at the moment is that we are 1 part US Army Corp of Engineers, and 1 part US Secret Service.

We are like the Corp of Engineers because we build infrastructure.  We survey problem sites.  We design solutions for the environments that we support.  This requires us to plan ahead, so that we are not short on resources, and so that our work is done in coordination with other efforts, so there aren’t costly delays.  We need to do requirements planning, and do peer reviews to ensure we do not have oversights in our plans that will lead to defects, and the implementations need to be inspected for defects as things are built.

If defects are found, we have to determine how much it impacts the project, and how much it will take to repair it.  If it will lead to a catastrophic failure, it is our duty to report it and try to raise awareness about the seriousness of the issue, so that the catastrophic failure will not occur.

When these steps are not followed, natural disasters, whose exact timing could not be predicted, but whose eventually can predicted from a long-term understanding of the environment and it’s use and rate of decay.

This feels like a very useful analogy to me, and I think I can use it as a value system for making judgments.  “Should I do this?”   “What should I do?”  are difficult questions on their own about any topic, but having a clear roadmap when I can imagine what I would expect out of the Corp of Engineers, just for my own and others safety, I can see the similarities and come to decisions for how best to help my organization.

Operations also serve another purpose though, which I think of as similar to the Secret Service in many interesting ways.  The Secret Service has many different jobs, but primarily they consist of protecting the President, protecting the emergency military response gear, protecting areas of Executive National Security, and they also deal with counterfeiting.

The Secret Service has the responsibility of last say in terms of security of the President and Executive interests.  They are supposed to veto any plan that puts the President in unnecessary danger, and while they may be over ridden at times, the domain of maintaining the proper security belongs to them.

The Secret Service plans for failures, and drills constantly to ensure they are ready for unexpected events and stay calm and respond according to a plan that has evolved from many previous experiences.

Finally, they have the responsibility to physically manhandle the President if an emergency situation breaks out.

I see a lot of parallels between Operations staff and the Secret Service.  We are responsible for ensuring that no one enters our systems without authorization.  We are responsible for ensuring that are data, customer data, and financial data are secure from being accessed by intruders.

We have to use certificates of authority, and secret handshakes with publicly and privately known information to allow our agents to communicate.

We are the ones who will first encounter problems with code on websites, or failures in hardware, and must plan and carry out an evacuation of any data that might be lost, or redirect traffic so that our customers aren’t sent to a broken service.

If attackers manage to enter our systems, we might isolate and quarantine them quickly, and thoroughly understand how they entered so we can stop it from being possible again, and find any similar vectors of attack.  We must then go through all our systems that can’t immediately be burned and resurrected, and ensure that they are not infected, or plan for a rapid migration to be able to clean the system and restore the safe perimeter of our organization.

Finally, I don’t think we need to tackle any of our management, but I think we do need to report our findings to them and impress upon them the accurate level of importance and urgency to each one, according to it’s impact on the business, and preferably with as much detail as how it is likely to impact the business and any hard facts and data that will back this up to help them make alterations in their plans based on how well the previous plans are working.

What I like about these models is that they allow me to more easily imagine what I should be doing, and how I should be acting, not in a cop-drama way, but as an illuminated path towards the responsibility my organization could use from me.

The relationship between defects and volume

February 10, 2010

Imagine listening to an old musical recording on a set of speakers.  At low volume, pops and cracks are noticeable because old recordings were done on records which had physical scratches that created the extra noise.

If the volume is turned up, so that the old recording is played louder, the pops and cracks will become more noticeable, and may start to take up more of your attention, and as the volume becomes very loud, the reduced experience from listening to the music with the pops and cracks may cause you to stop listening altogether, because the defects have made it unenjoyable.

Now imagine listening to a very clear new musical recording on the same set of speakers.  At low volume, the sound is clear and enjoyable.  At a reasonably high volume the sound is clear and enjoyable.  As you approach the maximum volume, pops and cracks start to be audible, as the defects in the speakers themselves are now being displayed.

Volume causes defects to become noticeable and important.

This is why a design for a web application that is quickly thrown together may work for an initial group of users, but once the site becomes popular, the application could crash or fail to keep up with the load.  The operation of the application may require many more servers to be purchased, or worse, not scale onto more servers and require purchasing an ever larger and more powerful single machine.

Horizontal growth can be purchased at slightly more than linear pricing, with a higher staffing or automation requirement, vertical growth grows exponentially more expensive, until it ceases to be possible to grow more vertically, and some horizontal scaling must be added (typically by horizontally scaling slowly with similar expensive vertical solutions).

The volume of use anything has directly corresponds to the likelihood that defects will cause a failure.

Once this is understood, and used to assist in decision making, then a solid plan can be put together for how long a defect can exist on the path to increasing volume.

Solutions Design: Fast, Cheap, Good. Choose 3.

February 9, 2010

In the past, the truism was: Fast, cheap, good.  Choose 2.

Whether talking about software design, or systems operations design, we are now living in the future.  Why has this changed?  Because the economies of scale for completing software and operations projects has changed.

The kernel of truth to this still exists, each of these elements, speed, cost and quality, can press the upon the others to unbalance any solution.  If a project is done fast, it can be done sloppily to not dot every “i” and cross every “t”.  To create a solution faster, you could hire more people, which makes the solution fail at cheapness.  If you want quality, you could spend a long time building it, and pay for the best in the business.

There is a new way of looking at all of this however, because like technology, other things have progressed with the times as well.  The relationship between fast, good and cheap has not changed, but changes of scale have made working faster, better and cheaper, so when best-of-breed hybrid techniques are applied none of the three areas needs to be disproportionate.

Our ability to understand project management has drastically improved over the past 30 years of mainstream computing solution design and development.  Our ability to understand how to use technology, with methodologies like Object Orientation, 3-Tier server systems, Agile development methodologies, to name only three well-known improvements, gives us a better way to approach solution creation.  There are thousands of improvements out there, in the public, available for anyone to learn and begin using.

Using this knowledge and these skills lead to the ability to work faster, and create higher quality work.  Additionally with the broad spectrum of commercial and open source solutions to leverage, there are many applications, services and libraries which turn many solutions into mostly integration work, and the majority of the original thinking is in managing the host of existing solutions.

“Fast, cheap, good: choose 2”, will remain a fun joke for those feeling the pressures to complete projects under pressures, but with all the information out there at your fingertips, all the work already completed and made available for you to use by others, and having seen personally proof to the contrary, I think it is time to retire this saying as a truism.

The bar has been raised; if you’re only getting 2 these days, you’re doing it wrong.

Emergency is just another word for incompetency

February 9, 2010

In the system administration world, unexpected events need to be expected.

Your hard drives failed?  The Mean Time Between Failure (MTBF) gave you a statistical prediction this was going to happen, you should have planned for it.

A network partition occurred?  All networks have the ability for multiple paths, and the cost-benefit algorithm allows adjustment for how much redundancy can be created to plan for all levels of internal and vendor failures.  The CAP (Consistency, Availability / Partitions) principle assists us further, telling us that in any networked environment, we either have to focus on consistency, with periods of unavailability or availability with eventual consistency.  In either case, if properly planned and implemented, a solution can provide an acceptable result and not create an emergency.

If fire drill work is erupting frequently enough that your team cannot replace legacy infrastructure fast enough, and is primarily tasked with responding to fires, then there is a systemic problem in place.

The solution to all of these problems is organization, planning, applying expertise to problem domains, having designs and work reviewed by qualified peers, and the all-encompassing requirement to care about the quality of work performed, levels of robustness, and comprehensive failure plans, both through automation and human processes.

When you are experiencing frequent emergencies, it is time to look inward towards your processes and approaches to your work, because some things are being done poorly enough to cause these emergencies, and failure to reflect and change will not lead to anything but more emergencies.