Another Demo: No Relay Chat

August 25, 2010

As another demonstration of how the unidist library can be put to use, primarily using shared state and message queues, I wrote an IRC-like web based chat server named No Relay Chat.

You can download the No Relay Chat Demo here, at procblock package’s Google Code page.  You’ll also need the latest procblock, which is on that page.

Below is a screen shot, but here is an online demo. Log in and make some channels.

There is a bug that appears intermittently with the New Channel pop up dialog,  close and try again or reload the page if you’re still interested at that point.  Also, closing a channel also intermittently seems to not unsubscribe, I may be missing a JS error on some case, or there could be a race condition setting channels again after unsubscribe or something.  The communications has always been solid and reliable, so I’m guessing it’s something like this.  I’ll clean those up over the next few days.

At a later date I’ll wrap this up into a full page with IM, ops, moderation, invite/secret channels, and other IRC goodness so it is more useful, and then I’ll package it up for easy installation via RPM/MSI/etc.


Local System Monitoring Demo

August 22, 2010

I have the first draft of the local system monitoring demo (single node) ready: It can be viewed here.

I’ll be flushing this out more after I finish the monitoring for Linux, and fix the Disk I/O to update properly in FreeBSD and OS X, and fix the View Internals for the RRDs, on RRDs that have multiple targets per type.  Then I’ll add some formatting for the sections, and make the list of items dynamic and so you can turn uninteresting ones off, and then I will ship that demo.  After a few more demos to finish testing out all the different packages that it takes to make up Red Eye Monitor (REM), then I will turn this into a real monitoring software install, that does good things out of the box, and works on single nodes or multiple nodes.


dropSTAR released as Python library

August 19, 2010

dropSTAR has now been released as a stand-alone Python library for creating an HTTP server.

I’ll be putting together RPM/Deb/make packages to get a more functional install for those not interested in the packages, but the functionality.  These will come with installers for modules on the dropSTAR and procblock platform, which will allow services to be packages and downloaded separately and will stay focused on providing functionality, not a lower level development framework.

More to come!


dropSTAR (webserver) procblock demo

August 18, 2010

Now for a demo with a bit more teeth.  This will soon be released as a stand alone open source web server package, named dropSTAR (Scripts, Templates and RPC).  It is designed to easily allow dropping in new dynamic pages, and is focused on system automation instead of the standard end-user applications that normal dynamic web servers are intended to serve.  It can do that kind of work too, but it’s not optimized for millions of page loads, it’s optimized to take as little time as possible to make and modify web pages for system administration use, and RPC for dynamic pages or system automation between nodes or processes.

The demo can be downloaded at the procblock downloads page.  You will need to install procblock as well, which is also available at that link.

A live version of the demo can be played with here.

The demo is a series of tabs, each doing something different:

CPU Monitoring

This tab shows a very simple CPU statistics collection program (runs a Shell command, parses and returns dict of fields), which runs in an Interval RunThread every 5 seconds, and then graphs the results.  The page automatically reloads the graph every 5 seconds so it can be watched interactively.

System Information

This is another very simple shell command, that cats /proc/cpu and puts the columns into an HTML table.

Logs

This tab reads out of the tail of a log file, reverses the lines, and splits the contents to format for HTML in color.  It updates every 5 seconds.

Requests

This is the most complex tab on the page.  It has a monitor for “requests”, which is a counter in the shared resources (unidist.sharedcounter module), and a thread will run with a delay and increment the requests.  The total number of requests since starting are showing in text, and the graph displays the change in this request variable over time.

A slider allows adjustment of the delay for requests, which will be saved in shared state (unidist.sharestate module).  Reloading the page will keep the slider in any changed position, and the graph/counter should correlate to the position of the slider in terms of more or less requests per second.

These is also a button called “Stop Request Poller” which Acquires a lock (unidist.sharedlock module), which stops the poller from incrementing the request counter.  If toggled again, requests will resume.

The right bottom side of the page has not been completed yet, and so just is there to look pretty and take up page space.  Later this will turn into an adjustable SLA monitor which will notify or alert (via the HTML page) that the SLA is near or out of tolerance with regard to requests a second.

Wall

This page shows the use of the message queues (unidist.messagequeue module), which allow messages to be inserted into a queue for later processing.  Any message typed into the input field with an enter key or Write button click will be inserted into the message queue “wall” in the shared message queues.  Any messages older than the last 25 are discarded to keep it only storing useful data.  Messages are not removed from the queue on reading, so that they can be continually re-processed for display.

Then in 5 seconds an RPC call will update the page with all the messages.


Simplest procblock Demo

August 18, 2010

This is the simplest demo I could think of, it features one script that is run by procblock, specified by a YAML file: simplest.yaml:

run:
 - script: simplest.py

Which runs simplest.py:

import random

def ProcessBlock(pipe_data, block, request_state, input_data, tag=None, cwd=None, env=None, block_parent=None):
 """Simplest demo possible."""
 data = {'random':random.randint(0, 100)}
 return data

To invoke procblock, run:

cd demo_simplest
./procblock simplest.yaml

This will invoke the script “simplest.py”, which has a standard module function ProcessBlock().  This is the standardized method for all code process blocks, and allows them to chain together, passing relevant information, and also shared data between them.  In this case, there is only one script, so it is simply returning it’s result.

Example:

monkey:demo_simplest ghowland$ ./procblock simplest.yaml 2> /dev/null
{'run': {'__duration': 0.010447025299072266,
 '__start_time': 1282114491.079386,
 'args': [],
 'random': 99}}
monkey:demo_simplest ghowland$

I redirected STDERR to /dev/null because I am leaving the STDERR logging on until procblock is ready to leave Beta Testing.

For a second example there is a slightly more complicated procblock called: simplest_monitor.yaml

As the name implies, this is the simplest monitor I could think of.  It monitors random numbers simplest.py generates every 5 seconds, stored them in an RRD file, and then graphs it.

Because it runs the script every 5 seconds, it is labled a “long running” process, and so it will continue to run until CTRL-C is pressed, which will send a notification for all threads to exit gracefully, which they will do if they are properly written.

Here is the contents of simplest_monitor.yaml:

run:
 - script: simplest.py
   cache: 5
   thread_id: simplest
   timeseries collect:
     path: simplest.rrd
     interval: 5

     fields:
       random:
         type: GAUGE

     graph:
       - path: simplest.png
         title: Simplest Monitoring Demo
         fields: [random]
         method: STACK
         interval: 10
         vertical label: "Random #s"

__usage:
  name: simplest
  author: Geoff Howland

  # Let this run, so we can monitor it
  longrunning: true

The big additions here are the “timeseries collect” statement, which defines what fields to collect from the run script, and how to graph it, and the addition of a __usage section to the YAML file, which defines the name of the block (simplest), the author, and sets longrunning=True, so that the script wont quit as soon as the thread is created to monitor simplest.py’s results.

Example 2:

Run:

monkey:demo_simplest ghowland$ ./procblock simplest_monitor.yaml 2> /dev/null
Running Thread: Starting: simplest
Waiting for interval thread output: simplest
{'run': {'__duration': 0.41039705276489258,
 '__start_time': 1282114743.028677,
 'random': 6,
 'run_thread.simplest': simplest: Is Running: True  Scripts: ['simplest.py']}}
^CRunning Thread: Quitting: simplest
monkey:demo_simplest ghowland$

The result, after a bit of waiting:


What happens this time is a bit different.  Right away we get a returned object, that shows the duration being quite short, and a single random number, and a field called “run_thread.simplest”.  “simplest” is the thread_id name I gave to this monitor thread.  The value of this is a __repr__() string representation from a RunThread object, and you can see it is running (Is Running: True) and running scripts [simplest.py].  It is also formatted in HTML, because that is where I am using it in testing web server internals right now.  Another Beta artifact.

Nothing else happens in this script, until I press CTRL-C, and it quits.  That is because I have the logging turned off.  With the logging on, it shows what is going on:

monkey:demo_simplest ghowland$ ./procblock simplest_monitor.yaml
DEBUG:20100818000408:procyaml.py:138:ImportYaml: Importing YAML: simplest_monitor.yaml
DEBUG:20100818000408:mainfunctions.py:434:ProcessAndLoop: Long Running Process: Starting...  (CWD: /Users/ghowland/blocks/demo_simplest)
DEBUG:20100818000408:procyaml.py:138:ImportYaml: Importing YAML: /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/procblock-201008.1-py2.6.egg/procblock/data/default_tag_functions.yaml
DEBUG:20100818000408:procyaml.py:138:ImportYaml: Importing YAML: /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/procblock-201008.1-py2.6.egg/procblock/data/default_condition_functions.yaml
DEBUG:20100818000408:rrd.py:48:StoreInRrd: Storing RRD Occurrance: simplest.rrd: 1282115048.28: {'random': 67}
DEBUG:20100818000408:rrd.py:183:GraphRrd: Graphing RRD: simplest.rrd
{'run': {'__duration': 0.4735870361328125,
 '__start_time': 1282115048.26404,
 'random': 67,
 'run_thread.simplest': simplest: Is Running: True  Scripts: ['simplest.py']}}
DEBUG:20100818000413:rrd.py:48:StoreInRrd: Storing RRD Occurrance: simplest.rrd: 1282115053.58: {'random': 93}
DEBUG:20100818000413:rrd.py:183:GraphRrd: Graphing RRD: simplest.rrd
DEBUG:20100818000418:rrd.py:48:StoreInRrd: Storing RRD Occurrance: simplest.rrd: 1282115058.84: {'random': 85}
DEBUG:20100818000418:rrd.py:183:GraphRrd: Graphing RRD: simplest.rrd
DEBUG:20100818000424:rrd.py:48:StoreInRrd: Storing RRD Occurrance: simplest.rrd: 1282115064.14: {'random': 57}
DEBUG:20100818000424:rrd.py:183:GraphRrd: Graphing RRD: simplest.rrd
DEBUG:20100818000429:rrd.py:48:StoreInRrd: Storing RRD Occurrance: simplest.rrd: 1282115069.48: {'random': 79}
DEBUG:20100818000429:rrd.py:183:GraphRrd: Graphing RRD: simplest.rrd
DEBUG:20100818000434:rrd.py:48:StoreInRrd: Storing RRD Occurrance: simplest.rrd: 1282115074.81: {'random': 32}
DEBUG:20100818000434:rrd.py:183:GraphRrd: Graphing RRD: simplest.rrd
^CDEBUG:20100818000435:mainfunctions.py:451:ProcessAndLoop: ProcessAndLoop: Keyboard Interrupt: Releasing lock: __running
DEBUG:20100818000435:mainfunctions.py:470:ProcessAndLoop: Quitting...
Running Thread: Quitting: simplest
monkey:demo_simplest ghowland$

The first 4 lines show procblock starting up, all YAML files that are loaded are logged, to be able to trace what it is doing.  Then the first run of simplest.py is made, and returned in the “run” tag result, along with the RunThread object which is still running in a thread.

Then every 5 seconds, the simplest.py:ProcessBlock() is invoked, and the result ({‘random’:99}) is stored in simplest.rrd, and then simplest.png is graphed.

Then I hit the CTRL-C key and this caused a shared lock called __running to be released, and the thread that was running simplest.py every 5 seconds quit the next time it was invoked, releasing the process and procblock is finished.


Launching in progress…

August 17, 2010

I’ve started launching Red Eye Mon, but it’s going to be a long road. Currently I have packaged up two of the core components:

  • procblock: a logic and data tag processor and hardended execution environment
  • unidist: Unified Distributed Computing for Python: Message Queues, Locks, Counters, State, Logs and Time Series

unidist is the base library, providing a bunch of Distributed Computing mechanisms, like locks, message queues, counters, state and logging.  This allows programs to share information easily, and in a some different ways to made using the shared data easy and reliable.

procblock is the work horse of the system.  It is a method to abstract the Architecture of a system of scripts, so that the scripts themselves remain simple, yet they are combined together to do complete things.

procblock is a hardened execution environment, which handles thread and process management in a variety of ways.  procblock is also a conditional tag processor, which can conditionally return data.

Mixing the tag processor functionality with the hardened execution environment functionality delivers some interesting results.

problock:

unidist:

A demo of a web server (dropSTAR) running via procblock, using the unidist tools, and doing graphing, using message queues, locks and counters, with code and data exposed, can be found here.

Next, I will post documentation of the unidist library, add unit tests, and start to release demos that can be downloaded with examples and instructions on how to use.

One of these demos will be the dropSTAR (scripts, templates and RPC) webserver, which runs via procblock, and is configured with procblock, and whose pages are rendered via procblock.  dropSTAR is another important component to the Red Eye Mon (REM) system, as it provides insight into and communication between nodes.

After procblock, unidist and dropSTAR releases are complete, I will release schemagen and the Mother Brain schema for Red Eye Mon, which will be the control database for managing automated systems, and then I will start to release all the ported sets of scripts for system management, deployment, monitoring and cloud management.


Continuous Testing

August 4, 2010

Today’s system and network monitoring primarily consists of collecting counters and usage percentages and alerting if they are too high or low, but more comprehensive analysis can be performed by using the same troubleshooting logic that would be performed if a human encountered a system condition, such as a resource being out of limits (ex. CPU at 90% utilization), and then encapsulating this logic along with the alert condition specification, so that automation can be created following the same procedures that would ideally be done by the human on duty.

By constantly aggregating data across services into new time series, this can be analyzed and re-processed in the same way the original collected data was, to create more in-depth alerting or re-action conditions, or to create even higher levels of insight into the operations of the system.

The key is to create the process exactly as the best human detective would do it, because the processes will need to map directly to your business and organizational goals, so it is important that the processes are created to map directly to those goals, and this is easiest to keep consistent over many updates if it is modeled after an idealized set of values for solving the business goals.

For example, a web server system could be running for a media website, which makes it’s money on advertising. They have access to their advertising revenue and ad hit rates through their provider, and can track the amount of money they are making (a counter data type), up to the last 5 minutes. They want to keep their servers running fast, but their profit margins are not high, so they need to keep their costs minimal. (I’m avoiding an enterprise scale example in order to keep the background situation concise.)

To create a continuous testing system to meet this organizations needs, a series of monitors can be set up to collect data about all relevant data points, such as using an API to collect from the advertising vendor, or scraping their web page if that wasn’t available. Collecting hourly cost (a counter data type) and count of running machine instances (a gauge data type) can be tracked to provide insight into the current costs, to compare against advertising revenues.

In addition to tracking financial information: system information, such as the number of milliseconds it takes their web servers to deliver a request (a gauge data type), can be stored. They have historical data that says that they get more readers requesting pages and make more money when their servers respond to requests faster, so having the lowest response times on the servers is a primary goal.

However, more servers cost more, and advertising rates fluctuate constantly. At times ads are selling for higher, at times less, and some times only default ads are shown, which pay nothing. During these periods the company can lose money by keeping enough machine instances running to keep their web servers responding faster than 200ms, at all times.

Doing another level of analysis on the time series data for the incoming ad revenue, the costs of the current running instances and the current web server response times, an algorithm can be developed to maximize the best response for maximizing revenues during both periods of high advertising revenues and periods of low advertising revenue. This can change the request times needed to create a new running instance to perhaps only 80% of request tests have to be under 200ms, instead of 100% of tests (out of the last 20 tests, at 5 second intervals), and at lower revenue returns raise the threshold to 400ms responses. This value for tolerances and trends could also be saved in a time series to compare against trended data on user sign-up and unsubscribes.

If the slow responses are causing user sign-ups to decrease in a way that impacts their long term goals, that could be factored into the cost allowed to keep response times low. This could be balanced against the population in targeted regions in the world, where they make the most of their revenues from, so they keep a sliding scaling between 200ms and 400ms depending on the percentage of their target population that is currently using their website, weighted along with the ad revenues.

This same method can work for deeper analysis of operating failures, such as the best machine to be selected as the next database replication master, based on historical data about it being under load and keeping up with it’s updates and it’s network latency to it’s neighbor nodes. Doing this extra analysis could avoid selecting a new master that has connectivity problems to several other nodes, creating less stable solution than if the network problems had been taken into account.

By looking at monitoring, graphing, alerting and automation as a set of continuous tests against the business and operational goals your organization’s automation can change from a passive alerting system into a more active business tool.