Is there a better way to measure availability?

laptop-errorFor the last quarter century, the general way in which we measure the availability of a networked asset has been the ping or connectivity test.  If we get a response from a ping test, the asset is Up; no response it’s Down.  Simple as that.  But is this the best measurement of availability?

Consider this scenario.  A linux server running Tomcat delivers the web interface of a critical business application.  For one reason or another, the Tomcat service stops responding to requests.  The OS still responds to pings.  From a purely technical perspective, the server is up.  But from a business perspective, its unable to perform its function.

From a macro perspective, this can stymie troubleshooting efforts.  In the first seconds after IT becomes aware that a problem exists with the service, an operator will likely look at a status board of all of the assets that contribute to the impacted service, hoping to isolate the problem to a specific system.  If we follow the old approach to availability, the offending asset appears operational, so the search continues among what could be hundreds or thousands of systems, routers, switches and other components of the service.  Significant time is lost, simply because the status board was showing green.

 

A Better Approach to Availability

Sample configuration item with associated Availability Event DefinitionApplying a one-size-fits-all methodology to measuring availability, such as ping tests, is doomed to fail from the beginning.  This is why FireScope has the option to configure any Event Definition (ED) of a Configuration Item (CI) as its Availability Event Definition.  In the screen shot to the left, you can see a Claims Management Database Server, which we associated with an Event Definition evaluating orders as its measure of availability.   Due to this configuration, we are measuring its functional role in the service, which can greatly aid in troubleshooting incidents.  Another advantage of this approach is that downstream issues, such as if we experience hung queries or the database application experiences performance issues, they will affect the results of this evaluation and still provide alerting.  None of these issues would impact a ping test.

Now, you may be thinking this is a lot of work assigning specific availability events to each CI, but we’ve got this covered.  Availability Definitions can be assigned to Blueprints, which can then be mass associated with Configuration Items for bulk configuration.  And down the road, if you identify a more meaningful measure of availability, changing the Blueprint applies the change to all associated CIs.

 

Seconds Matter

Every second counts when troubleshooting outages of business critical services, and the faster we can lift relevant intelligence to our users above the noise of innocuous events that are common in every data center, the faster our customers can identify, respond and resolve issues.  As you can see in the sample dashboard on the left, each CI’s availability is measured by how it contributes to its dependent IT service.  If we were to revisit the scenario mentioned in the beginning, where Tomcat fails on a critical web server, it’s status in a FireScope dashboard would immediately show red, helping operators quickly isolate the source of the problem and shaving minutes or even hours from the time it takes to restore normal operating conditions.

Sample Availability Dashboard in FireScope

When we start thinking of IT operations from a service perspective, this necessitates re-thinking how we measure availability.  The ability of assets to perform their role in the service is far more critical than whether or not the asset is functionally online, yet for the last quarter century our monitoring tools have consistently taken the easy way out and measured availability in an inflexible and highly inaccurate way.  Our approach, which empowers our customers to choose how they wish to define availability, has been a powerful differentiator for FireScope Stratis that further shows just how extensively we have thought about a better approach to service monitoring and management.

Leave a Reply