A fresh look at network investment priorities

Tue, 18th Sep 2012

FYI, this story is more than a year old

There are typically two costs to consider for both security operations and network operations: the capital expense of purchasing tools and the operational expenses of employees required to drive the tools.

Unfortunately, finance teams often fail to recognise that the number of ‘incidents' that NetOps and SecOps teams are being asked to deal with is increasing, while operational budgets are remaining constant, if not decreasing. This situation is at best troublesome, at worst a train-wreck waiting to happen.

To put the situation into context look at the tool investments that organisations are making today and the impact those investments have on operational functions.

The tools that ops teams use can be broken down into two categories:

1. Tools that PROTECT organisations from bad stuff getting into the network, such as firewalls, anti-spam, IPS, etc. and access control systems, are good for removing harmful stuff without anyone having to manage them on a day-to-day basis. Their downfall is that they only know about known-knowns.

2. Tools that DETECT bad stuff in the network include IDS, SIM, advanced malware detection, DDoS detection, etc. These technologies are important because they start to bridge the gap between the ‘known-knowns' and the ‘known-unknowns'.

From an operational standpoint, there's a gotcha with the DETECT category. For every tool that gets deployed, there's operational overhead required to manage the tool's output.

Organisations can't continually add capability without expanding the operational footprint, which typically they don't want to do. The risk of having the two out of balance is that while the new tools DETECT really important stuff there's nobody to deal with the alarms. They are busy trying to work out whether the last alarm generated by the last tool was real or not.

The answer is not to reduce the number of DETECTION tools in use, but rather to improve the efficiency of the engineers and analysts responsible for dealing with alarms. If the throughput of engineers per hour can be increased faster than the growth in problems, then the status-quo can be managed, but it requires a different capital investment profile than the one that most organisations are using today.

Do more with less

Analysts need a tool that does more with less; one that dramatically shortens the time it takes to figure out whether the alarm they are looking at is serious and whether there's any action required. A new technology category is emerging to complement the PROTECT and DETECT categories that's designed specifically to help improve analyst throughput by focusing on the RESPONSE - ROOT CAUSE ANALYSIS workflow. Underpinning this new technology category is network recording or full packet capture.

Right now, the IT investment profile for most is 70% PROTECTION, 30% DETECTION. In the future, that investment profile needs to shift to 60% PROTECTION, 20% DETECTION and 20% RESPONSE - ROOT CAUSE. The good news is that RESPONSE - ROOT CAUSE infrastructure cost can be shared by NetOps and SecOps as the core functionality is agnostic.

The key to effective response and root cause analysis is accurate historical network visibility, which means network recording. If analysts can go back in time easily and quickly to see, at packet level, exactly what happened at the time an alarm was generated then they can determine quickly whether its real, what happened and what to do about it.

122227622

Change the metric

Time-to-Resolution (TTR) is a function of historical network visibility – the more visibility that analysts have the more work they can chew through in any given period of time. When the number of incidents is going up and the amount of available analyst hours is going down, getting a handle on this is important.

Most organisations focus on Mean-Time-To-Resolution (MTTR). But the reality is that there's very little that can be done to move the needle on MTTR. Reducing the time it takes to fix the average problem from four hours to three hours 50 minutes is irrelevant. A more interesting metric to look at is MAX-TTR or the MAXIMUM time it takes to solve a problem. That's where a real impact can be made relatively easily.

For most organisations, TTR follows a standard distribution curve where the majority of incidents are clustered around the 4-hour mark. However, there's nearly always a long tail of events that take days or even weeks to deal with, and consume large amounts of scarce operational resources. The ability to go back in time to the point that a particular issue was reported or alarmed and identify exactly what happened enables organisations to dramatically reduce the time to resolution on those long-tail issues.

Dropping MAX TTR from 24 hours to four hours (or less) will undoubtedly have a profound impact on the amount of resources available to deal with problems. Also, by having true visibility, the quality of the remedies that engineers and analysts put in place can be dramatically improved. Root causes can be addressed, rather than just symptoms.

As soon as Capex spending on tools and Opex spending on heads get out of synch the operational model starts to break down and organisations are exposed to unacceptable levels of corporate risk. The next wave of technology investment cannot be Detection-related because organisations must invest in tools that improve the efficiency of analyst response to improve throughput per analyst/hour.

Understanding your distribution of fault-resolution by hour highlights the opportunity to unlock significant savings in the long tail that could enable existing resource levels to be maintained at least for a little while.

By: Tim Nichols, VP Marketing, Endace