REM Monitoring Public Beta

October 27, 2010

Here is the first package for the public beta.

To install this, you will need to have a Linux, FreeBSD or OS X operating system.

You will need python2.6 installed.

Install the PyYaml, jsonlib, setuptools, and Pygments Python packages.

You should have rrdtool and the snmpwalk binary installed, via Net-SNMP, installed on your system.

To install, unzip (sorry about OS X cruft), cd into the main directory, and you will find:

dropstar                jquery_drop_widget      procblock               rem_monitor             unidist

$ cd unidist ; sudo python2.6 setup.py install

$ cd ../procblock/ ; sudo python2.6 setup.py install

$ cd ../dropstar/ ; sudo python2.6 setup.py install

$ cd ../jquery_drop_widget/ ; sudo python2.6 setup.py install

These are also available from PyPI, if you wish to install them via easyinstall.

To run the application:

$ cd ../rem_monitor/

$ ./procblock monitor.yaml

Then point your browser to: http://localhost:2001/system_monitor/monitor

Substitute the hostname for localhost, if connecting remotely.

That’s it.  A proper installer for CentOS, Fedora and Debian, and usage documentation, will be forthcoming in the next few days.

Advertisements

REM Monitoring Talk at Yahoo this Wednesday

October 26, 2010

I’m giving a talk on the REM Monitoring system at the Large-Scale Production Engineering meet-up at Yahoo this Wednesday the 27th.

It will be the first public talk about the REM Monitoring system, and will mark the official Public Beta.  I hope to keep the beta short, to about 3-4 weeks, when the first full release can be made, and the monitoring system itself will stabilize and just get additional monitoring plugins and such.

If you want to take a look at the slides, you can download them here.


“CAP Theory” should have been “PAC Theory”

October 8, 2010

CAP obviously sounds a lot better, as it maps to a real word; that probably got it remembered.

However, I’m guessing it has helped to fail to make this concept understood.  The problem, is that the “P” comes last.

CAP: Consistency, Availability, Partitions.  Consistency == Good.  Availability == Good.  Partitions = Bad.

So we know we want C and A, but we don’t want P.  When we talk about CAP, we want to talk about how we want C and A, and let’s try to get around the P.

Except, this is the entire principle behind the “CAP Theory”, is that Partitions are a real event that can’t be avoided.  Have 100 nodes?  Some will fail, and you will have partitions between them.  Have cables or other media between nodes?  Some of those will fail, and nodes will have partitions between them.

Partitions can’t be avoided.  Have a single node?  It will fail, and you will have a partition between you and the resources it provides.

Perhaps had CAP been called PAC, then Partitions would have been front and center:

Due to Partitions, you must choose to optimize for Consistency or Availability.

The critical thing to understand is that this is not an abstract theory, this is set theory applied to reality.  If you have nodes that can become parted (going down, losing connectivity), and this can not be avoided in reality, then you have to choose between whether the remaining nodes operate in a “Maximize for Consistency” or “Maximize for Availability” mode.

If you choose to Maximize for Consistency, you may need to fail to respond, causing non-Availability in the service, because you cannot guarantee Consistency if you respond in a system with partitions, where not all the data is still guaranteed to be accurate.  Why can it not be guaranteed to be accurate?  Because there is a partition, and it cannot be known what is on the other side of the partition.  In this case, not being able to guarantee accuracy of the reported data means it will not be Consistent, and so the appropriate response to queries are to fail, so they do not receive inconsistent data.  You have traded availability, as you are now down, for consistency.

If you choose to Availability, you will be able to make a quorum of data, or make a best-guess as to the best data, and then return data.  Even with a network partition, requests can still be served, with the best possible data.  But is it always the most accurate data?  No, this cannot be known, because there is a partition in the system, not all versions of the data are known.  Concepts of a quorum of nodes exist to try to deal with this, but with the complex ways partitions can occur, these cannot be guaranteed to be accurate.  Perhaps they can be “accurate enough”, and that means again, that Consistency has been given up for Availability.

Often, giving up Consistency for Availability is a good choice.  For things like message forums, community sites, games, or other systems that deal with non-scare resources, this is a problem that is benefited by releasing the requirement for Consistency, because it’s more important people can use the service, and the data will “catch up” at some point and look pretty-consistent.

If you are dealing with scare resources like money, airplane seat reservations (!), or who will win an election, then Consistency is more important.  There are scarce resources being reserved by the request; to be inconsistent in approving requests means the scarce resources will be over-committed and there will be penalties external to the system to deal with.

The reality of working with systems has always had this give and take to it.  It is the nature of things to not be all things to all people, they only are what they are, and the CAP theory is just an explanation that you can’t have everything, and since you can’t, here is a clear definition of the choices you have:  Consistency or Availability.

You don’t get to choose not to have Partitions, and that is why the P:A/C theory matters.