“CAP Theory” should have been “PAC Theory”

October 8, 2010

CAP obviously sounds a lot better, as it maps to a real word; that probably got it remembered.

However, I’m guessing it has helped to fail to make this concept understood.  The problem, is that the “P” comes last.

CAP: Consistency, Availability, Partitions.  Consistency == Good.  Availability == Good.  Partitions = Bad.

So we know we want C and A, but we don’t want P.  When we talk about CAP, we want to talk about how we want C and A, and let’s try to get around the P.

Except, this is the entire principle behind the “CAP Theory”, is that Partitions are a real event that can’t be avoided.  Have 100 nodes?  Some will fail, and you will have partitions between them.  Have cables or other media between nodes?  Some of those will fail, and nodes will have partitions between them.

Partitions can’t be avoided.  Have a single node?  It will fail, and you will have a partition between you and the resources it provides.

Perhaps had CAP been called PAC, then Partitions would have been front and center:

Due to Partitions, you must choose to optimize for Consistency or Availability.

The critical thing to understand is that this is not an abstract theory, this is set theory applied to reality.  If you have nodes that can become parted (going down, losing connectivity), and this can not be avoided in reality, then you have to choose between whether the remaining nodes operate in a “Maximize for Consistency” or “Maximize for Availability” mode.

If you choose to Maximize for Consistency, you may need to fail to respond, causing non-Availability in the service, because you cannot guarantee Consistency if you respond in a system with partitions, where not all the data is still guaranteed to be accurate.  Why can it not be guaranteed to be accurate?  Because there is a partition, and it cannot be known what is on the other side of the partition.  In this case, not being able to guarantee accuracy of the reported data means it will not be Consistent, and so the appropriate response to queries are to fail, so they do not receive inconsistent data.  You have traded availability, as you are now down, for consistency.

If you choose to Availability, you will be able to make a quorum of data, or make a best-guess as to the best data, and then return data.  Even with a network partition, requests can still be served, with the best possible data.  But is it always the most accurate data?  No, this cannot be known, because there is a partition in the system, not all versions of the data are known.  Concepts of a quorum of nodes exist to try to deal with this, but with the complex ways partitions can occur, these cannot be guaranteed to be accurate.  Perhaps they can be “accurate enough”, and that means again, that Consistency has been given up for Availability.

Often, giving up Consistency for Availability is a good choice.  For things like message forums, community sites, games, or other systems that deal with non-scare resources, this is a problem that is benefited by releasing the requirement for Consistency, because it’s more important people can use the service, and the data will “catch up” at some point and look pretty-consistent.

If you are dealing with scare resources like money, airplane seat reservations (!), or who will win an election, then Consistency is more important.  There are scarce resources being reserved by the request; to be inconsistent in approving requests means the scarce resources will be over-committed and there will be penalties external to the system to deal with.

The reality of working with systems has always had this give and take to it.  It is the nature of things to not be all things to all people, they only are what they are, and the CAP theory is just an explanation that you can’t have everything, and since you can’t, here is a clear definition of the choices you have:  Consistency or Availability.

You don’t get to choose not to have Partitions, and that is why the P:A/C theory matters.


Local System Monitoring Demo

August 22, 2010

I have the first draft of the local system monitoring demo (single node) ready: It can be viewed here.

I’ll be flushing this out more after I finish the monitoring for Linux, and fix the Disk I/O to update properly in FreeBSD and OS X, and fix the View Internals for the RRDs, on RRDs that have multiple targets per type.  Then I’ll add some formatting for the sections, and make the list of items dynamic and so you can turn uninteresting ones off, and then I will ship that demo.  After a few more demos to finish testing out all the different packages that it takes to make up Red Eye Monitor (REM), then I will turn this into a real monitoring software install, that does good things out of the box, and works on single nodes or multiple nodes.


dropSTAR (webserver) procblock demo

August 18, 2010

Now for a demo with a bit more teeth.  This will soon be released as a stand alone open source web server package, named dropSTAR (Scripts, Templates and RPC).  It is designed to easily allow dropping in new dynamic pages, and is focused on system automation instead of the standard end-user applications that normal dynamic web servers are intended to serve.  It can do that kind of work too, but it’s not optimized for millions of page loads, it’s optimized to take as little time as possible to make and modify web pages for system administration use, and RPC for dynamic pages or system automation between nodes or processes.

The demo can be downloaded at the procblock downloads page.  You will need to install procblock as well, which is also available at that link.

A live version of the demo can be played with here.

The demo is a series of tabs, each doing something different:

CPU Monitoring

This tab shows a very simple CPU statistics collection program (runs a Shell command, parses and returns dict of fields), which runs in an Interval RunThread every 5 seconds, and then graphs the results.  The page automatically reloads the graph every 5 seconds so it can be watched interactively.

System Information

This is another very simple shell command, that cats /proc/cpu and puts the columns into an HTML table.

Logs

This tab reads out of the tail of a log file, reverses the lines, and splits the contents to format for HTML in color.  It updates every 5 seconds.

Requests

This is the most complex tab on the page.  It has a monitor for “requests”, which is a counter in the shared resources (unidist.sharedcounter module), and a thread will run with a delay and increment the requests.  The total number of requests since starting are showing in text, and the graph displays the change in this request variable over time.

A slider allows adjustment of the delay for requests, which will be saved in shared state (unidist.sharestate module).  Reloading the page will keep the slider in any changed position, and the graph/counter should correlate to the position of the slider in terms of more or less requests per second.

These is also a button called “Stop Request Poller” which Acquires a lock (unidist.sharedlock module), which stops the poller from incrementing the request counter.  If toggled again, requests will resume.

The right bottom side of the page has not been completed yet, and so just is there to look pretty and take up page space.  Later this will turn into an adjustable SLA monitor which will notify or alert (via the HTML page) that the SLA is near or out of tolerance with regard to requests a second.

Wall

This page shows the use of the message queues (unidist.messagequeue module), which allow messages to be inserted into a queue for later processing.  Any message typed into the input field with an enter key or Write button click will be inserted into the message queue “wall” in the shared message queues.  Any messages older than the last 25 are discarded to keep it only storing useful data.  Messages are not removed from the queue on reading, so that they can be continually re-processed for display.

Then in 5 seconds an RPC call will update the page with all the messages.


Launching in progress…

August 17, 2010

I’ve started launching Red Eye Mon, but it’s going to be a long road. Currently I have packaged up two of the core components:

  • procblock: a logic and data tag processor and hardended execution environment
  • unidist: Unified Distributed Computing for Python: Message Queues, Locks, Counters, State, Logs and Time Series

unidist is the base library, providing a bunch of Distributed Computing mechanisms, like locks, message queues, counters, state and logging.  This allows programs to share information easily, and in a some different ways to made using the shared data easy and reliable.

procblock is the work horse of the system.  It is a method to abstract the Architecture of a system of scripts, so that the scripts themselves remain simple, yet they are combined together to do complete things.

procblock is a hardened execution environment, which handles thread and process management in a variety of ways.  procblock is also a conditional tag processor, which can conditionally return data.

Mixing the tag processor functionality with the hardened execution environment functionality delivers some interesting results.

problock:

unidist:

A demo of a web server (dropSTAR) running via procblock, using the unidist tools, and doing graphing, using message queues, locks and counters, with code and data exposed, can be found here.

Next, I will post documentation of the unidist library, add unit tests, and start to release demos that can be downloaded with examples and instructions on how to use.

One of these demos will be the dropSTAR (scripts, templates and RPC) webserver, which runs via procblock, and is configured with procblock, and whose pages are rendered via procblock.  dropSTAR is another important component to the Red Eye Mon (REM) system, as it provides insight into and communication between nodes.

After procblock, unidist and dropSTAR releases are complete, I will release schemagen and the Mother Brain schema for Red Eye Mon, which will be the control database for managing automated systems, and then I will start to release all the ported sets of scripts for system management, deployment, monitoring and cloud management.


Continuous Testing

August 4, 2010

Today’s system and network monitoring primarily consists of collecting counters and usage percentages and alerting if they are too high or low, but more comprehensive analysis can be performed by using the same troubleshooting logic that would be performed if a human encountered a system condition, such as a resource being out of limits (ex. CPU at 90% utilization), and then encapsulating this logic along with the alert condition specification, so that automation can be created following the same procedures that would ideally be done by the human on duty.

By constantly aggregating data across services into new time series, this can be analyzed and re-processed in the same way the original collected data was, to create more in-depth alerting or re-action conditions, or to create even higher levels of insight into the operations of the system.

The key is to create the process exactly as the best human detective would do it, because the processes will need to map directly to your business and organizational goals, so it is important that the processes are created to map directly to those goals, and this is easiest to keep consistent over many updates if it is modeled after an idealized set of values for solving the business goals.

For example, a web server system could be running for a media website, which makes it’s money on advertising. They have access to their advertising revenue and ad hit rates through their provider, and can track the amount of money they are making (a counter data type), up to the last 5 minutes. They want to keep their servers running fast, but their profit margins are not high, so they need to keep their costs minimal. (I’m avoiding an enterprise scale example in order to keep the background situation concise.)

To create a continuous testing system to meet this organizations needs, a series of monitors can be set up to collect data about all relevant data points, such as using an API to collect from the advertising vendor, or scraping their web page if that wasn’t available. Collecting hourly cost (a counter data type) and count of running machine instances (a gauge data type) can be tracked to provide insight into the current costs, to compare against advertising revenues.

In addition to tracking financial information: system information, such as the number of milliseconds it takes their web servers to deliver a request (a gauge data type), can be stored. They have historical data that says that they get more readers requesting pages and make more money when their servers respond to requests faster, so having the lowest response times on the servers is a primary goal.

However, more servers cost more, and advertising rates fluctuate constantly. At times ads are selling for higher, at times less, and some times only default ads are shown, which pay nothing. During these periods the company can lose money by keeping enough machine instances running to keep their web servers responding faster than 200ms, at all times.

Doing another level of analysis on the time series data for the incoming ad revenue, the costs of the current running instances and the current web server response times, an algorithm can be developed to maximize the best response for maximizing revenues during both periods of high advertising revenues and periods of low advertising revenue. This can change the request times needed to create a new running instance to perhaps only 80% of request tests have to be under 200ms, instead of 100% of tests (out of the last 20 tests, at 5 second intervals), and at lower revenue returns raise the threshold to 400ms responses. This value for tolerances and trends could also be saved in a time series to compare against trended data on user sign-up and unsubscribes.

If the slow responses are causing user sign-ups to decrease in a way that impacts their long term goals, that could be factored into the cost allowed to keep response times low. This could be balanced against the population in targeted regions in the world, where they make the most of their revenues from, so they keep a sliding scaling between 200ms and 400ms depending on the percentage of their target population that is currently using their website, weighted along with the ad revenues.

This same method can work for deeper analysis of operating failures, such as the best machine to be selected as the next database replication master, based on historical data about it being under load and keeping up with it’s updates and it’s network latency to it’s neighbor nodes. Doing this extra analysis could avoid selecting a new master that has connectivity problems to several other nodes, creating less stable solution than if the network problems had been taken into account.

By looking at monitoring, graphing, alerting and automation as a set of continuous tests against the business and operational goals your organization’s automation can change from a passive alerting system into a more active business tool.


Rapid Operations Automation Development (ROAD)

August 4, 2010

I am getting ready to release Red Eye Monitor (REM) to the public for people to start using, and have been working at how to describe it.

It started as a monolithic system for automating system provisioning, configuration, deployments, monitoring and reactions. A full life-cycle automation system.

After implementing it in several different environments, and several different ways, I learned that while this is a viable method, it is not very ideal as it there are always more business goals that need to be pursued and there are many gaps that need to be filled in an operations environment and often people are not ready to think about their operations as a big picture. They’re not ready for an Authoritative System Automation system. It’s too much of a change for their processes and their way of thinking about their operations.

So in response development started to change to become more modular and light weight, and I think the REM system now most closely resembles a development platform, and specifically a type of Rapid Application Development (RAD) platform.

However, RADs are designed primarily around GUI applications, or end-user applications. The way that one thinks about a RAD and how one uses one is very different than an operational system, which is not a single application with some data sources and a UI and some logic tying the two together and doing validation and formatting.

An operational development platform requires a lot of communication devices, and ways to collect up custom data, and to process it into things that are general and can be acted upon, based on business goals that change frequently, and most important, it is a live and running system, in the wild.

A system of development tools, slanted for automation of live operations systems, to rapidly create new ways to monitor, configure and deploy data and logic and analyze how it’s all working has a very different feel to it than an application development system like Delphi or Visual Basic were when RADs were last in fad.

The REM suite of tools and libraries has become something I think of as a Rapid Operations Automation Development system, and over the next few days and weeks I will be releasing documentation and demos explaining all of the pieces that the suite consists of. This post and the previous one on Authoritative System Automation are intended to give background and context to these tools and libraries, as there are a lot of ways to use them together, and when used in conjunction they provide a system for authoritatively provisioning and controlling complex and large systems that scale and provide comprehensive insight into what is going on.

They are an architecture, with some standard libraries for working out of the box, which can be extended in many ways to account for the many goals of businesses and are bound to concepts, not specific technology, so that as new platforms emerge they can be integrated into an existing system, and wrapped with the same container logic as existing systems.

The current suite of REM tools and libraries is:

procblock: A hybrid data and logic processor. This is a primitive-data sourced tag processor which conditional returns data and is a hardened execution environment that manages threads and pipe-oriented-execution of scripts and shall commands. It will need it’s own documentation and examples to explain, which should be finished in a few days. Defaults with a YAML backend, making the tag logic look similar to Python. Tags are overloadable to custom data sourced mini-languages can be created for defining and processing data. Both data and code blocks are treated the same way, and are essentially interchangeable. Time series collection (default: RRDTool) and graphing are also included, as is interval based result caching (running in threads). This has command line parsing functionality to provide state directly from the command line, in addition to the normal Python invocation.

dropSTAR: A threaded web server that runs in a procblock, processes HTTP page requests and RPC code requests via procblocks. Made to conditionally import site configuration, and make it extremely fast to add custom code to one or more machines, specified dynamically by some configuration data. Inclusion of RPC makes all nodes easily networked in communicating their needs or states in any communication topology desired. Pages are constructed, by default, by running a procblock pipe of scripts, and templating the result through a text file. Portlets can be created by embedding recursive requests. Maximum flexibility is left to the developer of the scripts and minimal effort is needed to create a new script. Can cache long-running or fault-likely steps to avoid stalls in requests.

Shared Resources Library: A python library that includes a message queue, a shared lock manager, a shared state manager, a shared counter manager, and a shared connection pool manager (ex: database cursors). This allows very small scripts to be written, and communicate and coordinate with other small scripts. Logic should be made minimal, and through the shared resources enterprise level features can be added. This is not meant as a high-performance system, in competition with memcache/reddis or ActiveMQ/RabbitMQ, but is meant to be a solid development tool to rapidly create operation automation scripts without adding more operational overhead, which standalone shared resource software requires. All of these are based on thread-safe dictionaries and lists, keeping with data-primitives for maximum development flexibility and simplicity. Expansive logging and wrappers for executing code on a system are also included in this.

Time Series Library: A python library that wraps Time Series requests for a single node. Default implementation backend is RRD, but that is adjustable. Many-node systems will manage file locations and naming and use this library both locally on nodes, and on regional collectors, as desired.

Dot Inspection Library: A small stand-alone library that allows inspection and manipulation of data-primitives via strings. So strings “var1.var2”, “var1.0”, “var1.-1”, “var1.-1.var2”, would do inspection into a data-primitive using input data from a dictionary with “var1” and “var2” fields defined. Sub-inspections can be done like “var1.(svar1.svar2).var2”, so that inspections can be made dynamic. This allows deep configuration of data manipulation at run-time, and stored in data sources, allowing more to be done in procblock, and freeing up real logic code (Python/whatever) to be simpler and more about interacting with the operating system and other services, so that architectural issues and architectural goals can be a separated implementation process than direct interaction logic.

schemagen: A schema generator for a backend data source (default implementation is MySQL). Schema information and relations are specified in data (default: YAML) and can create or update a database or other data source. Implementation allows primary key access into the data source as well, so after the MySQL database has been generated, or updated, requests can be formatted through schemagen to extract or insert data, allowing both specification and interaction wrapping through schemagen.

Mother Brain: This is the schema definition for REM, and is fairly massive in scope, covering all physical and virtual hardware, their connections, OS platform specifications, software package specifications, service, users, monitoring, SLAs and everything else in the REM Authoritative System Automation system.

Utility Computing Library: A wrapper for dealing with utility computing (default: Amazon’s EC2). This wraps provisioning machine instances, disks, floating IPs and all the other resources that are required to create a cloud-like Utility Computing system.

REM scripts: Like the Utility Computing scripts, there are many scripts for managing databases, file systems, services, deployment, monitoring, user lists, pager rotation and many more topics. These integrate into dropSTAR and the Mother Brain to be run via procblock or normal system methods to perform actions and collect data.

These tools and libraries form the foundation of the REM system. REM is intended to be customized and expanded, and ultimately to allow “plans” for automated system administration to be published and shared, creating an open source environment for operations.


Unique Organizational Glue

February 16, 2010

Glue is the most important part of any organization, and the part which is always unique to every organization.  Understanding this, and knowing how to make and apply glue will make the difference between a smooth running organization and an organization that is constantly firefighting and often working against itself.

In more traditional organizations, glue was process.  When you wanted to create cohesiveness between your employees and departments, you needed to create a process, train the players in performing the process, and then have a combination of rewards and punishments for not following the processes that glue your organization together.

Human processes will never be removed, because we are humans and will always have needs too subtle and changing, and human, to be automated.

That said, many things in today’s organizations can and should be automated, and this is the glue with which I spend most of my time making and thinking about.

The thing about organization glue is that it is always unique to your organization.  You can buy off-the-shelf glue, but you still have to apply it uniquely, and take care of it uniquely, and train people how to use it uniquely, because no other business does exactly what your business does, with the exact people and structure your business has.

So whether you are a Buy-Everything-Microsoft-Makes shop or a Build-Everything-Myself-From-Open-Source shop, you are still configuring everything uniquely, to solve your unique problems.

Herein lies the reasons more expert operations people choose Linux and other *nix environments, because in these environments you are expected to make your own glue.  All the components may come readily available, and many of the glues are already pre-mixed, but you are still expected to figure out where to put it, how to configure it, and probably to write your own custom glue code to connect piece X to piece Y, because they don’t quite line up.

In a pre-packaged environment, much of this has been done for you, many processes have already been worked out, and you are expected to implement them to specification.  Certificate programs are created to align workers with the commercial packages they support to enforce these “best practices”.

The trouble comes when these pre-made glue stamps fail to meet all your organizations needs, and then you must create custom glue.  In an environment that expects or requires custom glue, this has a steep learning curve, but is expected and encouraged.  In an environment where everything is supposed to be planned for you to implement, it is very difficult to add your own processes, and agility is lost at the benefit for having the majority of your solution come out of a box.

Working around these unique elements, while still working with the system to not subvert the benefits you received from purchasing it, create an extremely difficult situation made worse by those who are not capable or do not believe in custom solutions.

The real problem is one of expectations.  As a unique service provider, your business will do some things uniquely.  If you are in an industry where your service is all taken care of by humans, then your internal operations may well be simple enough to use off-the-shelf glue, and it will work well enough.

In the complex and ever changing world of internet software companies, this is not the case, and never has been.  And yet, many people still do not understand that their organization requires custom glue, that their processes will not simply connect together, and that by leveraging the abilities of their senior staff to create custom solutions, that mix in with existing open source and purchased solutions, they can find an optimal balance between buying and building, that they never really had a choice between anyway.

You simply can’t buy it all, because no one sells “Your Business In a Box”.  It’s up to you to build your business, and if you do it well, it will seem like it fits in a box.  If you do it poorly, it will seem like a combination of post-Katrina wasteland and a forest fire.

Either way, it pays to understand that your business has unique goals, and that it will take unique glue to bind your employees and departments together to achieve those goals.