Detection Engineering and SOC Scalability Challenges (Part 2)

This blog series was written jointly with Amine Besson, Principal Cyber Engineer, Behemoth CyberDefence and one more anonymous collaborator.

This post is our second installment in the “Threats into Detections — The DNA of Detection Engineering” series, where we explore the challenges of detection engineering in more detail — and where threat intelligence plays (and where some hope appears … but you need to wait for Part 3 for this!)

Detection Engineering is Painful — and It Shouldn’t Be (Part 1)

Contrary to what some may think, a detection and response (D&R) success is more about the processes and people than about the SIEM. As one of the authors used to say during his tenure at Gartner, “SOC is first a team, then a process and finally a technology stack.” (and just repeated this at mWISE 2023) And here is another: “A great team with an average SIEM will run circles around the average team with a great SIEM

SIEMs, or whatever equivalent term you may prefer (A security data lake perhaps? But please no XDR… we are civilized people here), are essentially large scale telemetry analysis engines, running detection content over data stores and streams of data. The signals they produce are often voluminous without on-site tuning and context, and won’t bring value in isolation and without the necessary process stack.

It is the complex cyber defenders’ knowledge injected at every step of the rule creation and alert (and then incident) response process that is the real value-add of a SOC capability. Note that some of the rules/content may be created by the tool vendor while the rest is created by the customer.

So, yes, process is very important here, yet under the shiny new name of TDIR (Threat Detection and Incident Response), lies essentially a creaky process stack riddles by inefficiencies and toil:

  • Inconsistent internal documentation — and this is putting it generously, enough SOC teams run on tribal knowledge, and even an internal Wiki would be a huge improvement for them
  • Staggered and chaotic project management — SOC project management that is hard to understand and improve, that doesn’t happen smoothly, with release/delivery process that is completely irregular and traceability is often lost midway through the operational noise.
  • No blueprint to do things consistently — before we talk automation, let’s talk consistency. And this is hard with ad hoc process that is reinvented every time…
  • No automation to do things consistently and quickly — once the process is clear, how do we automate it? The answer often is “well, we don’t”, anyhow see the item just above…
  • Long onboarding of new log sources — while the 1990s are over, the organizations where a SOC needs to shove paper forms deep inside some beastly IT organizations to enable a new log source have not vanished yet.
  • Low awareness of removed or failed log sources — SOCs with low awareness of removed or failed log sources are at risk of missing critical security events and failed — worse, quietly failed — detections.
  • Large inertia to develop new detection content, low agility — if you make an annual process into quarterly, but what you need is a daily response, have you actually improved things?
  • Inscrutable and unmaintainable detection content — if the detection was not developed in a structured and meaningful way, then both alert triage and further refinement of detection code will ..ahem … suffer (this wins the Understatement of the Year award)
  • Technical bias, starting from available data rather than threats — this is sadly very common at less-mature SOCs. “What data do we collect?” tends to predate “what do we actually want to do?” despite “output-driven SIEM” concept having been invented before 2012 (to be honest, I stole the idea from a Vigilant consultant back in 2012).

While IT around your SOC may live in the “future” world of SRE, DevOps, GitOps and large scale automation, releasing new detections to the live environment is, surprisingly, often heavy on humans, full of toil and friction.

Not only is it often lacking sophistication (copy pasting from a sheet into a GUI), but it is also not tracked or versioned in many cases — which makes ongoing improvement challenging at best.

Some teams made good progress toward automation by using detection as-code, but adoption is still minimal. And apart from a handful of truly leading teams, it is often limited to deploying vendor-provided rules or code from public repositories (ahem, “detection as code written by strangers on the internet”, if you’d like…). As a result, it then poses a real challenge of reconciling internal and external rule tracking.

An astute reader will also point out that the very process of “machining” raw threat signals into polished production detections is very artisanal in most cases; but don’t despair, we will address this in the next parts of this series! This would be fun!

Apart from that, much of the process of creating new detections has two key problems:

  • Often it starts from available data, and not from relevant threats.
  • Prioritization is still very much a gut feeling affair based on assumption, individual perspective and analysis bias.

Instead, there should be a rolling evaluation of relevant and incoming threats, crossed with current capabilities. In other words, measuring detection coverage (how well we detect in our environment against the overall known threat landscape) which allows us to build a rolling backlog of threats to detect, identify logging / telemetry gaps and key improvement points to steer detection content development. This will turn an arts and crafts detection project into an industrial detection pipeline.

21 September 2023