As part of our ongoing investment in our hosting and support network, we’ve just completed the installation and testing of a completely new monitoring solution for all of our clients. There are a number of ways to monitor the health of your website. We wanted to ensure that our customers were getting the best possible feedback, and in the event of a problem, would be able to learn as much as possible, so that future issues could be mitigated, wherever possible.
There are essentially two components to the monitoring solution: internal network testing and monitors, and external up-time monitoring. We use a powerful set of tools to accomplish this important and time sensitive task.
Externally, we monitor every site, every server box, and the network itself via two independent sources. This is important, because it is possible for connectivity to be interrupted by issues external to our data center. By using two independently and geographically separated sources, we reduce the chances of an external interruption creating confusion for either our engineers or our clients.
The real power comes via our new hardware/software solution internally. Allow me to put on a techie cap for a few moments…
R/com uses a hierarchical data tree of ‘entities’ to store monitoring data. At the top of the tree is our Customer – and a software container for all monitoring data related to a single organization or installation. Beneath the Customer are the various individual Sites, which contain Devices (read that as hardware). Each Device is a single logical or physical node on your network that Lithium is monitoring (for example a Server, Switch, Router, or Storage Array). Within each Device is a hierarchy of Containers, Objects, Metrics and Triggers. Containers hold groups of Objects of the same type such as Network Interfaces or CPUs and each Object is a unique item of that type. A Metric is a polled or calculated value that relates to the operation of that Object (e.g. Percent Used, Input Packets Per Second, Temperature, etc). Triggers define conditions under which an Incident or fault condition should be raised for that Metric.
I know it’s a bit over the top in terms of techno-speak, but essentially, what we’re doing is monitoring every element within each server and within each website to determine if anything either will go wrong (a warning), or has gone wrong (a notification). An example might be a hard drive. Our new system will let us know if a drive is getting too full – but is not yet at capacity. We can then, in turn, share this information with our customer, hopefully resulting in their approval to make adjustments and to prevent a failure or problem.
Each monitored server and related component has an Operational Status of Normal (Green), Warning (Yellow), Impaired (Orange) or Critical (Red). The operational state of an Entity is controlled by the Triggers that are applied to the Metrics being collected or calculated for the device. Okay – say that quickly ten times. If there’s a problem, we’ll find out! We can also adjust the metrics so that each server and in turn, customer, can determine the “line in the sand” that we need to be aware of.
As this solution learns more about how the devices in our network are operating over time, we can extract trend analysis that predicts when a given Metric will reach the defined Trigger values. For example, we can perform a predictive trend analysis on a particular Storage Resource and provide a prediction on when that Resource will hit the Warning, Impaired or Critical trigger conditions set for it. We’ll consider this aspect of our solution active 90 days after our launch date (which is today).
Recorded Metric values are written to disk using the open-source RRDtool file format. Unlike other monitoring systems where the resolution of historical data is lost or truncated over time, our solution keeps track of every recorded sample from the moment it is activated. Data is stored in a rolling file-per-year and file-per-month data storage format in a very neatly arranged directory structure that follows the Customer, Site, Device, Container, Object and Metric hierarchy of monitored entities.
All in all, we’re very excited about this solution. We strongly believe it will provide added value and functionality to our hosting and support services, and will benefit all of our clients. Note that these capabilities are being provided at no additional charge to any of our annually hosted customers.