remlite: The Light and Fluffy Red Eye Monitor

January 26, 2010

The full install of Red Eye Monitor (REM) is a major project and will take a few more weeks to complete.  To get a jump on doing cloud management centrally, I am pulling a lot of pieces out of REM and creating remlite.  This will be a much smaller, non-relational database backed, non-Total Systems Automation implementation of a cloud manager.

Here’s the design overview for remlite:

http://redeyemon.sourceforge.net/remlite/


Why operational systems and programs require testing, debugging and refactoring.

January 24, 2010

An operational environment, such as a web site and web applications with their corresponding data and infrastructure, has many similarities to the development of an executable program.  Each of them has a collection of data, logic and resource requirements to perform a function, through a series of instructions, both of them process their execution requests against their data and run-time environments.

When an error occurs, an exception is thrown, sometimes loudly, sometimes it is quietly caught and ignored.  Many classes of errors can’t be found until the program is run, because the compiler cannot test for many kinds of logic problems, or issues with the run-time environment, and parts of the program that are interpreted at run-time may have problems that cannot be tested in advance of being run due to the flexibility of interpretation at runtime.

Similarly, when initially configuring an operational system, some errors can be found on setting up the services and storage, when making sure all the pieces connect together.  Like running a program and not seeing any errors on completion, the operational system can be tested for functionality while it is first put together.

However, those many classes of errors that can’t be found by the compiler may still exist in the areas of the code that have not yet been run, and as a general rule, they are always present.

In the areas of a program or a system that has not been tested, there are will be problems that halt the operation, and may or may not be immediately obvious what is wrong.  Additionally, the fix for the problem may not be something that can be done without a number of changes being made to the program or system.

In the case of a program, this is expected.  After writing a program, it is known to require a testing and debugging phase, and lately it has also become popular to refactor code so that the design stays robust and healthy, and does not become more entwined with coupled logic which becomes harder to change as it grows in scope and size.

Due to the similarities between a program and an operational system, operational systems also require testing, debugging and refactoring before they should be deployed from development to production.  The stages of development, testing, debugging and refactoring have become common sense in the development community, but are not always seen as important in the operational community.

An excellent method in both communities to quickly achieve a result is to build a rapid prototype.  This provides the creation of an functioning program or system, that can be proved to function, and that provides enormous insight into the requirements for integrating all the components together to create that program or system.

Additionally, rapid prototyping allows a working blueprint for how to transition to a production quality program or system.  By having a functioning prototype, all the pieces can been seen functioning together.  All the challenges of making the program or system work reliably will have experience to back up the initial plans and theories.

Once the rapid prototype is available, translating each of the visible requirements can be done rapidly, as new plans can be drawn up after reviewing the areas where the prototype’s design worked strongest and weakest.

Shortly thereafter, a tested, debugged and refactored program or system can be released with confidence that supporting it’s usage over time will be manageable because of the valuable insight the prototype provided and the robust improvements added by the production release.

After all, unlike a program which is simply stopped and started, an operational system is running 24/7.  Upgrading an operational system while it is carrying traffic is significantly harder and slower, and so significantly more costly, than before it has started to carry traffic for perpetuity.

A short canary test of the prototype, to determine how well it functions under user traffic, can provide another level of additional insight, and can take place in parallel with preparing the production release.


Red Eye Monitor (REM)

January 19, 2010

Red Eye Monitor (REM) is a project I am developing which is a Total System Automation framework and set of scripts for managing cloud vendor and data center machines and storage in an integrated and fully automated fashion.

The site resides on SourceForge, here:

http://redeyemon.sourceforge.net/

Because SourceForge doesn’t have any kind of blogging mechanism, I’m using WordPress to keep my custom tool sets to a minimum of this project.

At the moment I’m finishing documentation and the addition of the final SQL tables that add the data center and advance service and package concepts.  REM had already been successfully tested on a functioning system, but I have since added automated persistence for databases and storage now, and wanted to expand it to include a “home cloud”, where all non-cloud machines, whether virtualized or raw hardware, are treated the same way that a machine instance from a cloud vendor, like Amazon’s EC2, is treated.

This design is now present and documented in the URL listed above, the schema has been designed and my next steps after finishing writing the basic documentation is to merge the new schema in and wrap it up in the API code and basic web pages.

Parallel to this I’m building a cloud-only EC2 install to start building a full matrix of possible failures for this system, so that I can write tests to trigger each machine/system failure, and ensure that REM covers the system properly.  Fail tests would include writing bad data into configuration files, doing the same and restarting the services, writing random data over storage volumes, or critical kernel modules, deleting database tables, filling up the partition, changing permissions on log files, and all other unique failure conditions that can occur.

Once this matrix has been designed, I will put the REM installation through its paces and find which areas are already functioning resiliently and which are broken, and start writing the code to handle the broken cases.

I hope to get this in hand over the next couple of weeks, but the next day or two I’ll be finishing up the basic documentation which will also serve as the living design document.

If anyone is interested in this project, I’m not looking for coding support at the moment, as too many design aspects are still in swing.  I could definitely use experienced opinions on anything that catches your eye as a logic flaw, gap in the design, or missing failure case.

This project will be considered Beta when I have a comprehensive set of documentation, an ISO and EC2 AMI image that will kickstart a REM installation, and the failure matrix well-populated any fully tested with the initial REM packages.

The REM installation will consist of a Apache web server pool, with a MySQL backend, a postfix mail server, an NFS server to share static content between the Apache servers, and a syslog server.  The basics of a functioning internet presence.

When this can be tested for failures resiliently, then it is a matter of getting enough positive feedback to leave Beta.