Local System Monitoring Demo

August 22, 2010

I have the first draft of the local system monitoring demo (single node) ready: It can be viewed here.

I’ll be flushing this out more after I finish the monitoring for Linux, and fix the Disk I/O to update properly in FreeBSD and OS X, and fix the View Internals for the RRDs, on RRDs that have multiple targets per type.  Then I’ll add some formatting for the sections, and make the list of items dynamic and so you can turn uninteresting ones off, and then I will ship that demo.  After a few more demos to finish testing out all the different packages that it takes to make up Red Eye Monitor (REM), then I will turn this into a real monitoring software install, that does good things out of the box, and works on single nodes or multiple nodes.


dropSTAR released as Python library

August 19, 2010

dropSTAR has now been released as a stand-alone Python library for creating an HTTP server.

I’ll be putting together RPM/Deb/make packages to get a more functional install for those not interested in the packages, but the functionality.  These will come with installers for modules on the dropSTAR and procblock platform, which will allow services to be packages and downloaded separately and will stay focused on providing functionality, not a lower level development framework.

More to come!


dropSTAR (webserver) procblock demo

August 18, 2010

Now for a demo with a bit more teeth.  This will soon be released as a stand alone open source web server package, named dropSTAR (Scripts, Templates and RPC).  It is designed to easily allow dropping in new dynamic pages, and is focused on system automation instead of the standard end-user applications that normal dynamic web servers are intended to serve.  It can do that kind of work too, but it’s not optimized for millions of page loads, it’s optimized to take as little time as possible to make and modify web pages for system administration use, and RPC for dynamic pages or system automation between nodes or processes.

The demo can be downloaded at the procblock downloads page.  You will need to install procblock as well, which is also available at that link.

A live version of the demo can be played with here.

The demo is a series of tabs, each doing something different:

CPU Monitoring

This tab shows a very simple CPU statistics collection program (runs a Shell command, parses and returns dict of fields), which runs in an Interval RunThread every 5 seconds, and then graphs the results.  The page automatically reloads the graph every 5 seconds so it can be watched interactively.

System Information

This is another very simple shell command, that cats /proc/cpu and puts the columns into an HTML table.

Logs

This tab reads out of the tail of a log file, reverses the lines, and splits the contents to format for HTML in color.  It updates every 5 seconds.

Requests

This is the most complex tab on the page.  It has a monitor for “requests”, which is a counter in the shared resources (unidist.sharedcounter module), and a thread will run with a delay and increment the requests.  The total number of requests since starting are showing in text, and the graph displays the change in this request variable over time.

A slider allows adjustment of the delay for requests, which will be saved in shared state (unidist.sharestate module).  Reloading the page will keep the slider in any changed position, and the graph/counter should correlate to the position of the slider in terms of more or less requests per second.

These is also a button called “Stop Request Poller” which Acquires a lock (unidist.sharedlock module), which stops the poller from incrementing the request counter.  If toggled again, requests will resume.

The right bottom side of the page has not been completed yet, and so just is there to look pretty and take up page space.  Later this will turn into an adjustable SLA monitor which will notify or alert (via the HTML page) that the SLA is near or out of tolerance with regard to requests a second.

Wall

This page shows the use of the message queues (unidist.messagequeue module), which allow messages to be inserted into a queue for later processing.  Any message typed into the input field with an enter key or Write button click will be inserted into the message queue “wall” in the shared message queues.  Any messages older than the last 25 are discarded to keep it only storing useful data.  Messages are not removed from the queue on reading, so that they can be continually re-processed for display.

Then in 5 seconds an RPC call will update the page with all the messages.


Simplest procblock Demo

August 18, 2010

This is the simplest demo I could think of, it features one script that is run by procblock, specified by a YAML file: simplest.yaml:

run:
 - script: simplest.py

Which runs simplest.py:

import random

def ProcessBlock(pipe_data, block, request_state, input_data, tag=None, cwd=None, env=None, block_parent=None):
 """Simplest demo possible."""
 data = {'random':random.randint(0, 100)}
 return data

To invoke procblock, run:

cd demo_simplest
./procblock simplest.yaml

This will invoke the script “simplest.py”, which has a standard module function ProcessBlock().  This is the standardized method for all code process blocks, and allows them to chain together, passing relevant information, and also shared data between them.  In this case, there is only one script, so it is simply returning it’s result.

Example:

monkey:demo_simplest ghowland$ ./procblock simplest.yaml 2> /dev/null
{'run': {'__duration': 0.010447025299072266,
 '__start_time': 1282114491.079386,
 'args': [],
 'random': 99}}
monkey:demo_simplest ghowland$

I redirected STDERR to /dev/null because I am leaving the STDERR logging on until procblock is ready to leave Beta Testing.

For a second example there is a slightly more complicated procblock called: simplest_monitor.yaml

As the name implies, this is the simplest monitor I could think of.  It monitors random numbers simplest.py generates every 5 seconds, stored them in an RRD file, and then graphs it.

Because it runs the script every 5 seconds, it is labled a “long running” process, and so it will continue to run until CTRL-C is pressed, which will send a notification for all threads to exit gracefully, which they will do if they are properly written.

Here is the contents of simplest_monitor.yaml:

run:
 - script: simplest.py
   cache: 5
   thread_id: simplest
   timeseries collect:
     path: simplest.rrd
     interval: 5

     fields:
       random:
         type: GAUGE

     graph:
       - path: simplest.png
         title: Simplest Monitoring Demo
         fields: [random]
         method: STACK
         interval: 10
         vertical label: "Random #s"

__usage:
  name: simplest
  author: Geoff Howland

  # Let this run, so we can monitor it
  longrunning: true

The big additions here are the “timeseries collect” statement, which defines what fields to collect from the run script, and how to graph it, and the addition of a __usage section to the YAML file, which defines the name of the block (simplest), the author, and sets longrunning=True, so that the script wont quit as soon as the thread is created to monitor simplest.py’s results.

Example 2:

Run:

monkey:demo_simplest ghowland$ ./procblock simplest_monitor.yaml 2> /dev/null
Running Thread: Starting: simplest
Waiting for interval thread output: simplest
{'run': {'__duration': 0.41039705276489258,
 '__start_time': 1282114743.028677,
 'random': 6,
 'run_thread.simplest': simplest: Is Running: True  Scripts: ['simplest.py']}}
^CRunning Thread: Quitting: simplest
monkey:demo_simplest ghowland$

The result, after a bit of waiting:


What happens this time is a bit different.  Right away we get a returned object, that shows the duration being quite short, and a single random number, and a field called “run_thread.simplest”.  “simplest” is the thread_id name I gave to this monitor thread.  The value of this is a __repr__() string representation from a RunThread object, and you can see it is running (Is Running: True) and running scripts [simplest.py].  It is also formatted in HTML, because that is where I am using it in testing web server internals right now.  Another Beta artifact.

Nothing else happens in this script, until I press CTRL-C, and it quits.  That is because I have the logging turned off.  With the logging on, it shows what is going on:

monkey:demo_simplest ghowland$ ./procblock simplest_monitor.yaml
DEBUG:20100818000408:procyaml.py:138:ImportYaml: Importing YAML: simplest_monitor.yaml
DEBUG:20100818000408:mainfunctions.py:434:ProcessAndLoop: Long Running Process: Starting...  (CWD: /Users/ghowland/blocks/demo_simplest)
DEBUG:20100818000408:procyaml.py:138:ImportYaml: Importing YAML: /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/procblock-201008.1-py2.6.egg/procblock/data/default_tag_functions.yaml
DEBUG:20100818000408:procyaml.py:138:ImportYaml: Importing YAML: /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/procblock-201008.1-py2.6.egg/procblock/data/default_condition_functions.yaml
DEBUG:20100818000408:rrd.py:48:StoreInRrd: Storing RRD Occurrance: simplest.rrd: 1282115048.28: {'random': 67}
DEBUG:20100818000408:rrd.py:183:GraphRrd: Graphing RRD: simplest.rrd
{'run': {'__duration': 0.4735870361328125,
 '__start_time': 1282115048.26404,
 'random': 67,
 'run_thread.simplest': simplest: Is Running: True  Scripts: ['simplest.py']}}
DEBUG:20100818000413:rrd.py:48:StoreInRrd: Storing RRD Occurrance: simplest.rrd: 1282115053.58: {'random': 93}
DEBUG:20100818000413:rrd.py:183:GraphRrd: Graphing RRD: simplest.rrd
DEBUG:20100818000418:rrd.py:48:StoreInRrd: Storing RRD Occurrance: simplest.rrd: 1282115058.84: {'random': 85}
DEBUG:20100818000418:rrd.py:183:GraphRrd: Graphing RRD: simplest.rrd
DEBUG:20100818000424:rrd.py:48:StoreInRrd: Storing RRD Occurrance: simplest.rrd: 1282115064.14: {'random': 57}
DEBUG:20100818000424:rrd.py:183:GraphRrd: Graphing RRD: simplest.rrd
DEBUG:20100818000429:rrd.py:48:StoreInRrd: Storing RRD Occurrance: simplest.rrd: 1282115069.48: {'random': 79}
DEBUG:20100818000429:rrd.py:183:GraphRrd: Graphing RRD: simplest.rrd
DEBUG:20100818000434:rrd.py:48:StoreInRrd: Storing RRD Occurrance: simplest.rrd: 1282115074.81: {'random': 32}
DEBUG:20100818000434:rrd.py:183:GraphRrd: Graphing RRD: simplest.rrd
^CDEBUG:20100818000435:mainfunctions.py:451:ProcessAndLoop: ProcessAndLoop: Keyboard Interrupt: Releasing lock: __running
DEBUG:20100818000435:mainfunctions.py:470:ProcessAndLoop: Quitting...
Running Thread: Quitting: simplest
monkey:demo_simplest ghowland$

The first 4 lines show procblock starting up, all YAML files that are loaded are logged, to be able to trace what it is doing.  Then the first run of simplest.py is made, and returned in the “run” tag result, along with the RunThread object which is still running in a thread.

Then every 5 seconds, the simplest.py:ProcessBlock() is invoked, and the result ({‘random’:99}) is stored in simplest.rrd, and then simplest.png is graphed.

Then I hit the CTRL-C key and this caused a shared lock called __running to be released, and the thread that was running simplest.py every 5 seconds quit the next time it was invoked, releasing the process and procblock is finished.


Launching in progress…

August 17, 2010

I’ve started launching Red Eye Mon, but it’s going to be a long road. Currently I have packaged up two of the core components:

  • procblock: a logic and data tag processor and hardended execution environment
  • unidist: Unified Distributed Computing for Python: Message Queues, Locks, Counters, State, Logs and Time Series

unidist is the base library, providing a bunch of Distributed Computing mechanisms, like locks, message queues, counters, state and logging.  This allows programs to share information easily, and in a some different ways to made using the shared data easy and reliable.

procblock is the work horse of the system.  It is a method to abstract the Architecture of a system of scripts, so that the scripts themselves remain simple, yet they are combined together to do complete things.

procblock is a hardened execution environment, which handles thread and process management in a variety of ways.  procblock is also a conditional tag processor, which can conditionally return data.

Mixing the tag processor functionality with the hardened execution environment functionality delivers some interesting results.

problock:

unidist:

A demo of a web server (dropSTAR) running via procblock, using the unidist tools, and doing graphing, using message queues, locks and counters, with code and data exposed, can be found here.

Next, I will post documentation of the unidist library, add unit tests, and start to release demos that can be downloaded with examples and instructions on how to use.

One of these demos will be the dropSTAR (scripts, templates and RPC) webserver, which runs via procblock, and is configured with procblock, and whose pages are rendered via procblock.  dropSTAR is another important component to the Red Eye Mon (REM) system, as it provides insight into and communication between nodes.

After procblock, unidist and dropSTAR releases are complete, I will release schemagen and the Mother Brain schema for Red Eye Mon, which will be the control database for managing automated systems, and then I will start to release all the ported sets of scripts for system management, deployment, monitoring and cloud management.


Rapid Operations Automation Development (ROAD)

August 4, 2010

I am getting ready to release Red Eye Monitor (REM) to the public for people to start using, and have been working at how to describe it.

It started as a monolithic system for automating system provisioning, configuration, deployments, monitoring and reactions. A full life-cycle automation system.

After implementing it in several different environments, and several different ways, I learned that while this is a viable method, it is not very ideal as it there are always more business goals that need to be pursued and there are many gaps that need to be filled in an operations environment and often people are not ready to think about their operations as a big picture. They’re not ready for an Authoritative System Automation system. It’s too much of a change for their processes and their way of thinking about their operations.

So in response development started to change to become more modular and light weight, and I think the REM system now most closely resembles a development platform, and specifically a type of Rapid Application Development (RAD) platform.

However, RADs are designed primarily around GUI applications, or end-user applications. The way that one thinks about a RAD and how one uses one is very different than an operational system, which is not a single application with some data sources and a UI and some logic tying the two together and doing validation and formatting.

An operational development platform requires a lot of communication devices, and ways to collect up custom data, and to process it into things that are general and can be acted upon, based on business goals that change frequently, and most important, it is a live and running system, in the wild.

A system of development tools, slanted for automation of live operations systems, to rapidly create new ways to monitor, configure and deploy data and logic and analyze how it’s all working has a very different feel to it than an application development system like Delphi or Visual Basic were when RADs were last in fad.

The REM suite of tools and libraries has become something I think of as a Rapid Operations Automation Development system, and over the next few days and weeks I will be releasing documentation and demos explaining all of the pieces that the suite consists of. This post and the previous one on Authoritative System Automation are intended to give background and context to these tools and libraries, as there are a lot of ways to use them together, and when used in conjunction they provide a system for authoritatively provisioning and controlling complex and large systems that scale and provide comprehensive insight into what is going on.

They are an architecture, with some standard libraries for working out of the box, which can be extended in many ways to account for the many goals of businesses and are bound to concepts, not specific technology, so that as new platforms emerge they can be integrated into an existing system, and wrapped with the same container logic as existing systems.

The current suite of REM tools and libraries is:

procblock: A hybrid data and logic processor. This is a primitive-data sourced tag processor which conditional returns data and is a hardened execution environment that manages threads and pipe-oriented-execution of scripts and shall commands. It will need it’s own documentation and examples to explain, which should be finished in a few days. Defaults with a YAML backend, making the tag logic look similar to Python. Tags are overloadable to custom data sourced mini-languages can be created for defining and processing data. Both data and code blocks are treated the same way, and are essentially interchangeable. Time series collection (default: RRDTool) and graphing are also included, as is interval based result caching (running in threads). This has command line parsing functionality to provide state directly from the command line, in addition to the normal Python invocation.

dropSTAR: A threaded web server that runs in a procblock, processes HTTP page requests and RPC code requests via procblocks. Made to conditionally import site configuration, and make it extremely fast to add custom code to one or more machines, specified dynamically by some configuration data. Inclusion of RPC makes all nodes easily networked in communicating their needs or states in any communication topology desired. Pages are constructed, by default, by running a procblock pipe of scripts, and templating the result through a text file. Portlets can be created by embedding recursive requests. Maximum flexibility is left to the developer of the scripts and minimal effort is needed to create a new script. Can cache long-running or fault-likely steps to avoid stalls in requests.

Shared Resources Library: A python library that includes a message queue, a shared lock manager, a shared state manager, a shared counter manager, and a shared connection pool manager (ex: database cursors). This allows very small scripts to be written, and communicate and coordinate with other small scripts. Logic should be made minimal, and through the shared resources enterprise level features can be added. This is not meant as a high-performance system, in competition with memcache/reddis or ActiveMQ/RabbitMQ, but is meant to be a solid development tool to rapidly create operation automation scripts without adding more operational overhead, which standalone shared resource software requires. All of these are based on thread-safe dictionaries and lists, keeping with data-primitives for maximum development flexibility and simplicity. Expansive logging and wrappers for executing code on a system are also included in this.

Time Series Library: A python library that wraps Time Series requests for a single node. Default implementation backend is RRD, but that is adjustable. Many-node systems will manage file locations and naming and use this library both locally on nodes, and on regional collectors, as desired.

Dot Inspection Library: A small stand-alone library that allows inspection and manipulation of data-primitives via strings. So strings “var1.var2”, “var1.0”, “var1.-1”, “var1.-1.var2”, would do inspection into a data-primitive using input data from a dictionary with “var1” and “var2” fields defined. Sub-inspections can be done like “var1.(svar1.svar2).var2”, so that inspections can be made dynamic. This allows deep configuration of data manipulation at run-time, and stored in data sources, allowing more to be done in procblock, and freeing up real logic code (Python/whatever) to be simpler and more about interacting with the operating system and other services, so that architectural issues and architectural goals can be a separated implementation process than direct interaction logic.

schemagen: A schema generator for a backend data source (default implementation is MySQL). Schema information and relations are specified in data (default: YAML) and can create or update a database or other data source. Implementation allows primary key access into the data source as well, so after the MySQL database has been generated, or updated, requests can be formatted through schemagen to extract or insert data, allowing both specification and interaction wrapping through schemagen.

Mother Brain: This is the schema definition for REM, and is fairly massive in scope, covering all physical and virtual hardware, their connections, OS platform specifications, software package specifications, service, users, monitoring, SLAs and everything else in the REM Authoritative System Automation system.

Utility Computing Library: A wrapper for dealing with utility computing (default: Amazon’s EC2). This wraps provisioning machine instances, disks, floating IPs and all the other resources that are required to create a cloud-like Utility Computing system.

REM scripts: Like the Utility Computing scripts, there are many scripts for managing databases, file systems, services, deployment, monitoring, user lists, pager rotation and many more topics. These integrate into dropSTAR and the Mother Brain to be run via procblock or normal system methods to perform actions and collect data.

These tools and libraries form the foundation of the REM system. REM is intended to be customized and expanded, and ultimately to allow “plans” for automated system administration to be published and shared, creating an open source environment for operations.


Unique Organizational Glue

February 16, 2010

Glue is the most important part of any organization, and the part which is always unique to every organization.  Understanding this, and knowing how to make and apply glue will make the difference between a smooth running organization and an organization that is constantly firefighting and often working against itself.

In more traditional organizations, glue was process.  When you wanted to create cohesiveness between your employees and departments, you needed to create a process, train the players in performing the process, and then have a combination of rewards and punishments for not following the processes that glue your organization together.

Human processes will never be removed, because we are humans and will always have needs too subtle and changing, and human, to be automated.

That said, many things in today’s organizations can and should be automated, and this is the glue with which I spend most of my time making and thinking about.

The thing about organization glue is that it is always unique to your organization.  You can buy off-the-shelf glue, but you still have to apply it uniquely, and take care of it uniquely, and train people how to use it uniquely, because no other business does exactly what your business does, with the exact people and structure your business has.

So whether you are a Buy-Everything-Microsoft-Makes shop or a Build-Everything-Myself-From-Open-Source shop, you are still configuring everything uniquely, to solve your unique problems.

Herein lies the reasons more expert operations people choose Linux and other *nix environments, because in these environments you are expected to make your own glue.  All the components may come readily available, and many of the glues are already pre-mixed, but you are still expected to figure out where to put it, how to configure it, and probably to write your own custom glue code to connect piece X to piece Y, because they don’t quite line up.

In a pre-packaged environment, much of this has been done for you, many processes have already been worked out, and you are expected to implement them to specification.  Certificate programs are created to align workers with the commercial packages they support to enforce these “best practices”.

The trouble comes when these pre-made glue stamps fail to meet all your organizations needs, and then you must create custom glue.  In an environment that expects or requires custom glue, this has a steep learning curve, but is expected and encouraged.  In an environment where everything is supposed to be planned for you to implement, it is very difficult to add your own processes, and agility is lost at the benefit for having the majority of your solution come out of a box.

Working around these unique elements, while still working with the system to not subvert the benefits you received from purchasing it, create an extremely difficult situation made worse by those who are not capable or do not believe in custom solutions.

The real problem is one of expectations.  As a unique service provider, your business will do some things uniquely.  If you are in an industry where your service is all taken care of by humans, then your internal operations may well be simple enough to use off-the-shelf glue, and it will work well enough.

In the complex and ever changing world of internet software companies, this is not the case, and never has been.  And yet, many people still do not understand that their organization requires custom glue, that their processes will not simply connect together, and that by leveraging the abilities of their senior staff to create custom solutions, that mix in with existing open source and purchased solutions, they can find an optimal balance between buying and building, that they never really had a choice between anyway.

You simply can’t buy it all, because no one sells “Your Business In a Box”.  It’s up to you to build your business, and if you do it well, it will seem like it fits in a box.  If you do it poorly, it will seem like a combination of post-Katrina wasteland and a forest fire.

Either way, it pays to understand that your business has unique goals, and that it will take unique glue to bind your employees and departments together to achieve those goals.