REM Monitoring Public Beta

October 27, 2010

Here is the first package for the public beta.

To install this, you will need to have a Linux, FreeBSD or OS X operating system.

You will need python2.6 installed.

Install the PyYaml, jsonlib, setuptools, and Pygments Python packages.

You should have rrdtool and the snmpwalk binary installed, via Net-SNMP, installed on your system.

To install, unzip (sorry about OS X cruft), cd into the main directory, and you will find:

dropstar                jquery_drop_widget      procblock               rem_monitor             unidist

$ cd unidist ; sudo python2.6 setup.py install

$ cd ../procblock/ ; sudo python2.6 setup.py install

$ cd ../dropstar/ ; sudo python2.6 setup.py install

$ cd ../jquery_drop_widget/ ; sudo python2.6 setup.py install

These are also available from PyPI, if you wish to install them via easyinstall.

To run the application:

$ cd ../rem_monitor/

$ ./procblock monitor.yaml

Then point your browser to: http://localhost:2001/system_monitor/monitor

Substitute the hostname for localhost, if connecting remotely.

That’s it.  A proper installer for CentOS, Fedora and Debian, and usage documentation, will be forthcoming in the next few days.

Advertisements

REM Monitoring Talk at Yahoo this Wednesday

October 26, 2010

I’m giving a talk on the REM Monitoring system at the Large-Scale Production Engineering meet-up at Yahoo this Wednesday the 27th.

It will be the first public talk about the REM Monitoring system, and will mark the official Public Beta.  I hope to keep the beta short, to about 3-4 weeks, when the first full release can be made, and the monitoring system itself will stabilize and just get additional monitoring plugins and such.

If you want to take a look at the slides, you can download them here.


REM Monitoring Demo

September 29, 2010

The first REM Monitor demo is now up:

http://ge01f.com/test/system_monitor/monitor

This demo shows monitoring remote hosts via ping and HTTP.  Host monitoring is off because graphing has not been optimized enough yet to not take up a majority of machine resources when there are hundreds of graphs that can be rendered.

Hosts can be added or removed, and new HTTP alerts can be added.

More to come…  SNMP Host monitoring, SLAs based alerts to role accounts with shifts and prioritized contact lists with delay between escalation pending alert acknowledgment.  Among other things.

You can see the old Local System Monitoring Demo, showing a set of host monitors, here:

http://ge01f.com/sys/system_monitor/


Another Demo: No Relay Chat

August 25, 2010

As another demonstration of how the unidist library can be put to use, primarily using shared state and message queues, I wrote an IRC-like web based chat server named No Relay Chat.

You can download the No Relay Chat Demo here, at procblock package’s Google Code page.  You’ll also need the latest procblock, which is on that page.

Below is a screen shot, but here is an online demo. Log in and make some channels.

There is a bug that appears intermittently with the New Channel pop up dialog,  close and try again or reload the page if you’re still interested at that point.  Also, closing a channel also intermittently seems to not unsubscribe, I may be missing a JS error on some case, or there could be a race condition setting channels again after unsubscribe or something.  The communications has always been solid and reliable, so I’m guessing it’s something like this.  I’ll clean those up over the next few days.

At a later date I’ll wrap this up into a full page with IM, ops, moderation, invite/secret channels, and other IRC goodness so it is more useful, and then I’ll package it up for easy installation via RPM/MSI/etc.


dropSTAR (webserver) procblock demo

August 18, 2010

Now for a demo with a bit more teeth.  This will soon be released as a stand alone open source web server package, named dropSTAR (Scripts, Templates and RPC).  It is designed to easily allow dropping in new dynamic pages, and is focused on system automation instead of the standard end-user applications that normal dynamic web servers are intended to serve.  It can do that kind of work too, but it’s not optimized for millions of page loads, it’s optimized to take as little time as possible to make and modify web pages for system administration use, and RPC for dynamic pages or system automation between nodes or processes.

The demo can be downloaded at the procblock downloads page.  You will need to install procblock as well, which is also available at that link.

A live version of the demo can be played with here.

The demo is a series of tabs, each doing something different:

CPU Monitoring

This tab shows a very simple CPU statistics collection program (runs a Shell command, parses and returns dict of fields), which runs in an Interval RunThread every 5 seconds, and then graphs the results.  The page automatically reloads the graph every 5 seconds so it can be watched interactively.

System Information

This is another very simple shell command, that cats /proc/cpu and puts the columns into an HTML table.

Logs

This tab reads out of the tail of a log file, reverses the lines, and splits the contents to format for HTML in color.  It updates every 5 seconds.

Requests

This is the most complex tab on the page.  It has a monitor for “requests”, which is a counter in the shared resources (unidist.sharedcounter module), and a thread will run with a delay and increment the requests.  The total number of requests since starting are showing in text, and the graph displays the change in this request variable over time.

A slider allows adjustment of the delay for requests, which will be saved in shared state (unidist.sharestate module).  Reloading the page will keep the slider in any changed position, and the graph/counter should correlate to the position of the slider in terms of more or less requests per second.

These is also a button called “Stop Request Poller” which Acquires a lock (unidist.sharedlock module), which stops the poller from incrementing the request counter.  If toggled again, requests will resume.

The right bottom side of the page has not been completed yet, and so just is there to look pretty and take up page space.  Later this will turn into an adjustable SLA monitor which will notify or alert (via the HTML page) that the SLA is near or out of tolerance with regard to requests a second.

Wall

This page shows the use of the message queues (unidist.messagequeue module), which allow messages to be inserted into a queue for later processing.  Any message typed into the input field with an enter key or Write button click will be inserted into the message queue “wall” in the shared message queues.  Any messages older than the last 25 are discarded to keep it only storing useful data.  Messages are not removed from the queue on reading, so that they can be continually re-processed for display.

Then in 5 seconds an RPC call will update the page with all the messages.


Continuous Testing

August 4, 2010

Today’s system and network monitoring primarily consists of collecting counters and usage percentages and alerting if they are too high or low, but more comprehensive analysis can be performed by using the same troubleshooting logic that would be performed if a human encountered a system condition, such as a resource being out of limits (ex. CPU at 90% utilization), and then encapsulating this logic along with the alert condition specification, so that automation can be created following the same procedures that would ideally be done by the human on duty.

By constantly aggregating data across services into new time series, this can be analyzed and re-processed in the same way the original collected data was, to create more in-depth alerting or re-action conditions, or to create even higher levels of insight into the operations of the system.

The key is to create the process exactly as the best human detective would do it, because the processes will need to map directly to your business and organizational goals, so it is important that the processes are created to map directly to those goals, and this is easiest to keep consistent over many updates if it is modeled after an idealized set of values for solving the business goals.

For example, a web server system could be running for a media website, which makes it’s money on advertising. They have access to their advertising revenue and ad hit rates through their provider, and can track the amount of money they are making (a counter data type), up to the last 5 minutes. They want to keep their servers running fast, but their profit margins are not high, so they need to keep their costs minimal. (I’m avoiding an enterprise scale example in order to keep the background situation concise.)

To create a continuous testing system to meet this organizations needs, a series of monitors can be set up to collect data about all relevant data points, such as using an API to collect from the advertising vendor, or scraping their web page if that wasn’t available. Collecting hourly cost (a counter data type) and count of running machine instances (a gauge data type) can be tracked to provide insight into the current costs, to compare against advertising revenues.

In addition to tracking financial information: system information, such as the number of milliseconds it takes their web servers to deliver a request (a gauge data type), can be stored. They have historical data that says that they get more readers requesting pages and make more money when their servers respond to requests faster, so having the lowest response times on the servers is a primary goal.

However, more servers cost more, and advertising rates fluctuate constantly. At times ads are selling for higher, at times less, and some times only default ads are shown, which pay nothing. During these periods the company can lose money by keeping enough machine instances running to keep their web servers responding faster than 200ms, at all times.

Doing another level of analysis on the time series data for the incoming ad revenue, the costs of the current running instances and the current web server response times, an algorithm can be developed to maximize the best response for maximizing revenues during both periods of high advertising revenues and periods of low advertising revenue. This can change the request times needed to create a new running instance to perhaps only 80% of request tests have to be under 200ms, instead of 100% of tests (out of the last 20 tests, at 5 second intervals), and at lower revenue returns raise the threshold to 400ms responses. This value for tolerances and trends could also be saved in a time series to compare against trended data on user sign-up and unsubscribes.

If the slow responses are causing user sign-ups to decrease in a way that impacts their long term goals, that could be factored into the cost allowed to keep response times low. This could be balanced against the population in targeted regions in the world, where they make the most of their revenues from, so they keep a sliding scaling between 200ms and 400ms depending on the percentage of their target population that is currently using their website, weighted along with the ad revenues.

This same method can work for deeper analysis of operating failures, such as the best machine to be selected as the next database replication master, based on historical data about it being under load and keeping up with it’s updates and it’s network latency to it’s neighbor nodes. Doing this extra analysis could avoid selecting a new master that has connectivity problems to several other nodes, creating less stable solution than if the network problems had been taken into account.

By looking at monitoring, graphing, alerting and automation as a set of continuous tests against the business and operational goals your organization’s automation can change from a passive alerting system into a more active business tool.


Authoritative System Automation

August 4, 2010

As I’ve been developing the Red Eye Monitor (REM) system, I have spent a lot of time thinking about how to think about automation. How to talk about it, how to explain it, how to do it; all of it is hard because there are more pieces to system automation than can be kept in our brain’s symbolic lookup table at one time. I believe most people will top out at 3-4 separate competing symbols and there are somewhere between 10 and 20 major areas of concern in system automation, making it very difficult to keep things in mind enough to come up with a big picture that remains comprehensive as the detail magnification levels go up and down.

I had initially phrased REM as a “Total Systems Automation” tool, because it was meant to be comprehensive, and comprehensive automation is a fairly new idea, even though many people have been working on it for a long time, it just hasn’t caught on yet, really.

Having attempted to explain what a comprehensive automation system consists of, and what benefits it has, I have now settled on a new, shinier and better name: Authoritative System Automation (ASA).

Why Authoritative System Automation?

During my many discussions with people about a comprehensive automation system, I have settled on the single-most unique and distinguishing element between this kind system and other kinds of automation, is that there is only ONE concept of what the system should look like: The authoritative information that defines what the ideal system should be.

All benefits of an ASA system comes from the fact that with only ONE blue print for the ideal system, all actual details (monitored and collected from reality) are compared to the ideal blue print, and then actions are taken to remedy any differences. This could mean deploying new code or configuration data for a now-out-of-date system after a deployment request, restarting a service that is performing below service level agreements (SLA), provisioning a new machine to decommission a machine that fails critical tests, or paging a human to fill the loop on a problem that automation is not capable or prepared to solve.

Authoritative is a super set of comprehensive, because all aspects of the system come from the authoritative data source. Authoritative data sources can only have 1 master. Systems could be split so that there are several systems that run independently, but to be an authoritative system, each of them has to respond only to the requirements as detailed by the authoritative data source that backs the authoritative system automation.

Authoritative System Automation is vastly different from typical automation practiced today, even if that automation is “comprehensive”. Any system where data resides in more than one place cannot be an authoritative system, and will need to be updated in multiple places for maintenance. Upgrades will need to track multiple places, and mismatches can occur where data is out of date in one system compared to another.

All of this reduces confidence in automation, and means that the system cannot truly be automated. It can only be mostly-automated, which is vastly different.

An Authoritative Automated System will never require or allow an independent change, unless the piece being changed has been removed or frozen from automation, the ASA will revert the change or otherwise take corrective actions, and so the independent change is essentially an attack against the ASA. A properly configured and complete ASA will correct the change, and the system will continue.

Another aspect that is a major difference between an ASA and a non-ASA automated system is that an ASA system will have EVERY piece of data stored in the ASA’s authoritative data source. No configuration data will be stored in a text file, if it is configuration data, it is in the ASA data source and then mixes with templates to become a working configuration file or data when deployed to a system.

This is a major change from most automation attempts, as they merge automated scripts, databases with information for the scripts to operate on, and configuration text and other data sources together, to create a fabric of automation.

An ASA cannot have a single piece of configuration data outside of the ASA data source, or it is no longer authoritative. If the non-ASA attempted to reconfigure something, a side effect can crop up between the new “authoritative” data update and some other source of automation or configuration data.

An ASA system must be built 100% from it’s authoritative data, and must be maintained 100% from it’s authoritative data. Any deviation, manual configuration or data embedded anywhere, breaches the goal of an ASA and creates a non-ASA.

I see Authoritative System Automation as the highest form of automation, and the only comprehensive option, as it is completely data based, and logic serves only to push the data out to reality, and to collect from reality to compare against the ideal. The benefits of an ASA system is confidence in the system doing the right thing, and is testable and verifiable because at every stage in the automation, both the intent and expected result are known and are the same.