Today’s system and network monitoring primarily consists of collecting counters and usage percentages and alerting if they are too high or low, but more comprehensive analysis can be performed by using the same troubleshooting logic that would be performed if a human encountered a system condition, such as a resource being out of limits (ex. CPU at 90% utilization), and then encapsulating this logic along with the alert condition specification, so that automation can be created following the same procedures that would ideally be done by the human on duty.
By constantly aggregating data across services into new time series, this can be analyzed and re-processed in the same way the original collected data was, to create more in-depth alerting or re-action conditions, or to create even higher levels of insight into the operations of the system.
The key is to create the process exactly as the best human detective would do it, because the processes will need to map directly to your business and organizational goals, so it is important that the processes are created to map directly to those goals, and this is easiest to keep consistent over many updates if it is modeled after an idealized set of values for solving the business goals.
For example, a web server system could be running for a media website, which makes it’s money on advertising. They have access to their advertising revenue and ad hit rates through their provider, and can track the amount of money they are making (a counter data type), up to the last 5 minutes. They want to keep their servers running fast, but their profit margins are not high, so they need to keep their costs minimal. (I’m avoiding an enterprise scale example in order to keep the background situation concise.)
To create a continuous testing system to meet this organizations needs, a series of monitors can be set up to collect data about all relevant data points, such as using an API to collect from the advertising vendor, or scraping their web page if that wasn’t available. Collecting hourly cost (a counter data type) and count of running machine instances (a gauge data type) can be tracked to provide insight into the current costs, to compare against advertising revenues.
In addition to tracking financial information: system information, such as the number of milliseconds it takes their web servers to deliver a request (a gauge data type), can be stored. They have historical data that says that they get more readers requesting pages and make more money when their servers respond to requests faster, so having the lowest response times on the servers is a primary goal.
However, more servers cost more, and advertising rates fluctuate constantly. At times ads are selling for higher, at times less, and some times only default ads are shown, which pay nothing. During these periods the company can lose money by keeping enough machine instances running to keep their web servers responding faster than 200ms, at all times.
Doing another level of analysis on the time series data for the incoming ad revenue, the costs of the current running instances and the current web server response times, an algorithm can be developed to maximize the best response for maximizing revenues during both periods of high advertising revenues and periods of low advertising revenue. This can change the request times needed to create a new running instance to perhaps only 80% of request tests have to be under 200ms, instead of 100% of tests (out of the last 20 tests, at 5 second intervals), and at lower revenue returns raise the threshold to 400ms responses. This value for tolerances and trends could also be saved in a time series to compare against trended data on user sign-up and unsubscribes.
If the slow responses are causing user sign-ups to decrease in a way that impacts their long term goals, that could be factored into the cost allowed to keep response times low. This could be balanced against the population in targeted regions in the world, where they make the most of their revenues from, so they keep a sliding scaling between 200ms and 400ms depending on the percentage of their target population that is currently using their website, weighted along with the ad revenues.
This same method can work for deeper analysis of operating failures, such as the best machine to be selected as the next database replication master, based on historical data about it being under load and keeping up with it’s updates and it’s network latency to it’s neighbor nodes. Doing this extra analysis could avoid selecting a new master that has connectivity problems to several other nodes, creating less stable solution than if the network problems had been taken into account.
By looking at monitoring, graphing, alerting and automation as a set of continuous tests against the business and operational goals your organization’s automation can change from a passive alerting system into a more active business tool.