Raising the alarm

The utility control room and the IT service desk have much to learn from each other. Dave Stow of United Utilities and John Brown of WRc explain how the water industry could benefit

In the central control room of a large water utility, operators manage alarms from across a large number and wide variety of assets. They use processes that have evolved over the years, and which appear to be remarkably similar to those used by your IT service desk when you call to say your printer is not working.

However, the two functions of IT service desk and control room within a water utility are more similar than they might first appear, so let us briefly examine what lessons they can learn from each other.

Almost all IT service desks use business processes that are part of the Information Technology Infrastructure Library (ITIL). This was originally developed in the 1980s as 44 volumes of dryly worded best practice, which languished almost unused for years until condensed into the seven volume ITIL v2 in 1998. ITIL is now the most widely accepted approach to IT service management in the world.

One of the most visible manifestations of ITIL is the service desk where users of IT can report that something is not working. ITIL clearly describes how a notification of an issue should be dealt with. Not so the literature on managing alarms.

Although there is much written about alarm systems (for example EEMUA 191, which is widely recognised as the leading benchmark of best practice in industrial alarm systems), the focus is almost exclusively upon how alarms should be detected and presented to the operator.

Very little is said about what the operator should do in response to an alarm, particularly if they are in a central control room handling alarms from a vast, distributed network of assets.

This has led to utility businesses developing their own processes for managing the response to an alarm, and those processes have evolved to be almost identical to the service desk Incident Management process. Here we briefly examine this process to see what best practice can be taken from it.

In the service desk, a user calls in with an issue, or Incident, such as “my printer isn’t working”, and this is categorised and prioritised. In the control room, telemetry raises a prioritised alarm, such as “pump tripped”. Both the service desk and control room operators carry out some level of initial diagnosis of what might be wrong. If the issue cannot be corrected remotely (or dismissed as a false alarm) then both will escalate it in one or both of two directions:

  • Functionally: someone is found who can go and correct whatever is wrong
  • Hierarchically: someone up the management chain is informed, either to make them aware that something bad has happened (For the control room, this might be – “We’ve lost output from a major works”; for the service desk, “All the internet service is out of action!”); or to gain access to a resource needed to resolve whatever is wrong (“Fred’s not answering his phone (again)!”).

Eventually the right resource is found, and they will resolve the issue, sometimes by using a known workaround until a proper fix can be effected and service is restored.

The final step is a “close” operation, where the actions taken to resolve the issue are captured, service restoration is confirmed and the incident is closed.

This step is key to “closing the loop” on an alarm’s life-cycle. It ensures that the work done really has cleared the alarm, and that the steps taken are not lost: it allows lessons to be learnt which can be used to improve the asset.

Though this step usually appears in the documented alarm response process, in a busy control room where the priority is the next alarm, it can be an afterthought. Looking deeper into the ITIL processes for the service desk, there is another process mandated called Problem Management.

A Problem is an “unknown cause of one or more incidents”. It may be thought of as being the root cause of one or more alarms.

The key difference is that while Incident Management makes the alarm/incident go away, Problem Management identifies why the alarm occurred in the first place and corrects the underlying root cause.

ITIL states that Problem Management is both reactive and proactive. In proactive problem management the history of failures is reviewed carefully to identify underlying causes that should be rectified. Is proactive problem management formally pursued in the water industry? Any alarm reduction programme will have some implied aim of correcting the underlying causes of the alarms, however it is the contention of the authors that this process should be made more formal.

An alarm reduction programme should avoid merely trying to cut alarm numbers, and be focused on a proactive search for the causes of those alarms. Reactive Problem Management is where the service desk create a problem and assign incidents to it, allowing the problem to be managed as a single entity.

This functionality is universally available in service desk software, and raises some ideas that should be considered for control rooms and the alarm management software used there. Where a failure causes a large number of alarms (for example, a failure in an early stage of a treatment works), the operators should be able to create a Problem and group all the consequential alarms under it, allowing them to manage the Problem as a single entity, passing it to a resource who can correct this underlying cause.

Something that ITIL does not discuss is that a known root cause such as wet weather or power failure can trigger alarm floods in less well developed assets.

The proactive Problem Management discussed above will pick these up as targets for investment, but where we know that a specific root cause will trigger these alarm floods, then it should be possible to use a form of reactive problem management to automatically group the alarms under a problem.

This is preferable to using consequential suppression: the alarms remain in existence and accessible, but are easily managed as a single problem. Once the root cause is corrected, it is possible to confirm that the alarms have cleared in a timely manner.

Having discussed the possibilities of the control room learning from ITIL, is there anything ITIL can learn from the water industry? The authors believe so. There is an Event Management process where IT equipment is instrumented and alarms are generated.

From a water industry viewpoint this looks much like applying instrumentation and SCADA to a treatment process, but the ITIL alarm generation process is lacking. Here the process control industry has much to offer, and the good work developed in EEMUA 191 is directly applicable to this ITIL Event Management process, and ITIL should look to it in order to improve this process.

There is something that ITIL and the utility industries can learn from each other. ITIL should look to improve its monitoring of equipment as touched on above, and the utilities can take advantage of the good practices that have been around in the IT service desk by having them cross-pollinated into the  control room and the alarm management processes and software used there.

Finally, though this article has focused on control room alarm systems and the ITIL Incident and Problem Management processes, it is felt that there are other synergies that are worthy of investigation, for example the Configuration Management Database and the Change and Release Management processes in ITIL.

There is great experience in the process control world that is applicable to ITIL, and ITIL too has something to teach the control room.

Action inspires action. Stay ahead of the curve with sustainability and energy newsletters from edie