Automating Hunt Hypotheses with an Agent Swarm

I read a lot of threat intel, and most of what comes through a feed is news, not something you can hunt. A report tells you an actor abused a legitimate RMM tool, or a CVE is under active exploitation. Good to know, but it isn't a hunt. The work that's left is the part that matters: given telemetry from a customer environment nobody has baselined, what specific behavior do I go look for, and how do I tell it apart from the thousand benign things that look almost the same?

I built a multi-agent pipeline to take the first pass at that. Every morning it pulls the current reporting off my CTI dashboard, splits the reading across a few specialized agents, scores what comes back against an MDR-feasibility rubric, and posts the strongest hunts to a rolling weekly board. It runs on a schedule and deploys its own output, and I review what it ships.

How the Pieces Fit

It's a scheduled job that runs once a day, boots, makes one pass from collection to publish, and exits. Collection pulls the current reporting and turns it into one source list every agent reads from. Three scouts work in parallel. A deterministic harness merges what they hand back, scores it, and checks that every claim traces to a real source. Then a quality gate throws out anything that doesn't meet the bar, and whatever is left gets rewritten in my voice and merged into the board.

The Scouts and the Harness

There are three scouts instead of one analyst because focus beats breadth. A single agent asked to read eighty mixed reports drifts toward whatever is loudest that day, so I split the work by surface: one scout for cloud and identity, one for endpoint, one for supply chain. They run at the same time and each returns its two or three strongest candidates with the source URLs behind them. Every URL has to exist verbatim in the bundle the scout was handed, and the harness checks that on merge, so the model never gets to make up an actor or a CVE and have it sail through.

The scouts are the only part that uses a model. The harness around them is plain code that merges the candidates, scores them, penalizes anything that keeps resurfacing day after day, and validates the batch against a strict schema. The model does the reading and the writing, but it doesn't decide what happens with its own output.

The Quality Bar

This is the rule that does most of the work. A hunt is only worth publishing if it points at specific tradecraft and works for an analyst who doesn't know the customer's environment. That second half is what usually gets skipped. An MDR sees a lot of tenants and doesn't know which ones sanction a given RMM tool or run a particular scheduled task on purpose, so a hunt that depends on knowing the baseline just fires on every tenant that legitimately does the thing.

Early on the swarm handed me a candidate: hunt for newly installed NinjaOne agents from phishing paths and rule out the approved rollout windows. It reads fine until you remember that plenty of shops deploy NinjaOne on purpose and an MDR has no idea which installs were approved. That hunt pushes the real decision onto baseline knowledge the analyst doesn't have. The same reporting becomes a real hunt once the discriminator lives in the logic instead: an RMM agent whose parent is a mail client or browser within minutes of a link click, that then spawns an interactive shell on a single host and beacons to an instance not seen anywhere else in the fleet. That chain is visible without knowing the baseline, and a scheduled rollout looks nothing like it.

So the gate bans a set of shapes outright: "first seen" or "new install" of a tool as the core signal, "is this tool authorized here," "rule out the approved rollouts," "watch for suspicious logins." What it requires instead is a discriminator the analyst can see blind: delivery lineage, process ancestry, fleet-outlier infrastructure, or a concrete named artifact from the source like a file path, registry key, or control-plane API call. A hunt that can't carry its own discriminator gets rejected, and a thin source that can't support one gets dropped rather than dressed up.

The Weekly Board

The output isn't a fresh card every day. A daily card forces the swarm to publish something whether the day produced anything worth hunting or not, and it throws away yesterday's good hunt to make room for today's mediocre one. So what gets published is a rolling board of the strongest hunts from the last seven days, best at the top. Each morning's run merges into it rather than replacing it: new hunts that clear the bar get added, anything past the window drops off, and the whole set gets re-ranked. A quiet day adds nothing and that's fine, the board still shows a full week of the best work.

The board is public, so anything weak that slips through is sitting there for anyone to see, which is its own pressure to keep the gate tight. Every run also writes me a private brief with the full picture: what got added, every candidate the scouts considered, and what each one rejected and why. The public side is what I'm comfortable putting my name on; the private side is where I go to see what the swarm passed over and decide whether it was right to.

Looking Back

I expected the hard part to be getting the agents to find good hunts, but that turned out to be the easy part. The model is a capable reader and three focused scouts surface plenty to work with. The hard part was pinning down what "good" means tightly enough that code could enforce it, and almost all of that came down to one rule: the hunt has to carry its own discriminator, because the analyst on the other end doesn't know the customer's baseline and never will. If you build something like this, that's where your time goes. The model will hand you a hunt that reads great and fires on every tenant that runs the tool, and the gate is the only thing that catches it.

0 comments