New Research: 6,943 AI agent skills have security flaws. We scanned all 40,059. Read the report →
Back to Journal
ResearchApril 7, 2026·8 min read

84% Detection on Novel Attacks: Inside Our Monitor Eval

We evaluated Firmis Monitor against 278 CVE-derived attack scenarios and 49 published CVE reproductions. Here is every number, including the gaps.

TL;DR

  • 84.17% of CVE-derived attack scenarios blocked before they could reach your credentials, files, or network. No prior exposure to any test case.
  • 100% detection across six critical categories: backdoor, exfil, identity, lateral movement, persistence, and RCE. These cover the attacks that cause real breaches.
  • Known gaps published transparently: unicode tag injection, wget chains, port exposure. You can evaluate whether these are in your threat model.
  • Blast radius scoring catches model mistakes that rules miss. 0% false positives on real coding sessions, so your workflow stays uninterrupted.

Evaluation methodology matters more than the headline number. A detector trained and tested on the same dataset can report 99% accuracy while failing completely on the first novel attack it encounters in production. We built three independent test sets to avoid that failure mode, and this post explains every number from each one.

Three-Tier Evaluation Design

We used three distinct holdout sets, each measuring something different. The overall 84% figure comes from the set that matters most: cases the system never saw during rule development.

SetNIndependence
B4v2 Regression280 attacks + 99 benignNot independent. Prior training data.
Pentest Novel (primary)278 attack casesFully independent. Developed from published CVEs, never seen during rule development.
Rehberger CVE (adversarial)49 cases, 13 productsMaximally adversarial. Published CVE research across AI coding products.
Why B4v2 numbers are not the primary claim
Regression sets confirm you have not broken existing rules. They do not measure generalization. A detector with 100% on B4v2 and 0% on novel cases is worse than useless: it has overfit to known patterns while leaving real threats undetected. All primary claims in this post use pentest-novel.

Primary Results: Pentest Novel (N=278)

84.17%
Overall Detection Rate
234/278
Cases Detected
6
Categories at 100%
278
Novel Cases

All 278 cases were developed from published CVEs across real AI agent products. No case was seen during rule development. For teams evaluating runtime protection, this is the number that matters: 84% of real-world attack patterns are blocked before they reach your credentials, files, or network. The remaining 16% are documented below so you can assess whether they fall within your threat model.

CategoryDetectionNotes
Backdoor100%All cases detected
Exfil100%All cases detected
Identity100%All cases detected
Lateral movement100%All cases detected
Persistence100%All cases detected
RCE100%All cases detected
Git manipulation83.3%One edge case missed
Config poisoning69.0%Novel config formats
Supply chain19.4%Handled by separate package checker layer
Injection0%Requires response inspection layer, not yet deployed

The 19.4% on supply chain is not a monitor failure. Supply chain attacks are covered by a separate package integrity checker that runs at install time, not at runtime. The monitor eval tested runtime detection only. The 0% on injection is a genuine gap: injection attacks require inspecting model output responses, which the current monitor layer does not do.

Adversarial Results: Rehberger CVE Set (N=49)

The Rehberger set is the hardest test we ran. These 49 cases come from published CVE research across 13 AI coding products. The attacks are maximally adversarial: designed by a skilled researcher specifically to evade detection in real production systems.

79.6%
Rehberger Detection Rate
39/49
Cases Detected
8
CVE Types at 100%
5
Known Gap Types
CVE CategoryDetection
code-injection-rce100%
cross-agent-config-poison100%
html-img-exfil100%
markdown-image-exfil100%
mcp-rug-pull100%
path-traversal100%
spaiware-memory-inject100%
zombai-clickfix100%
unicode-tag-injection0/4
zombai-download-execute0/3
token-exfil0/2
port-exposure0/3
self-config-rce0/2

What the gaps tell us

  • unicode-tag-injection hides payloads in Unicode tag characters. Detection requires Unicode normalization in the rule engine, not yet implemented.
  • zombai-download-execute chains wget with shell execution across two separate commands. The rule engine evaluates commands individually, not as sequences.
  • token-exfil and port-exposure require network-layer inspection beyond filesystem and process monitoring.
  • self-config-rce targets the monitor configuration itself. A known bootstrapping problem.
  • All five are on the roadmap. Documenting them here is deliberate.

Transparency about gaps is not a concession. It is the only way to give users an accurate threat model.

Systematic Scorer: Holdout Results (N=230)

Beyond rule-based detection, Firmis Monitor includes a systematic scorer that computes a risk score across multiple behavioral signals. The scorer was evaluated on a separate 230-case holdout set.

0.9535
F1 Score
at threshold 0.15
98.8%
Recall
attacks caught
92.7%
Precision
true positive rate
0%
FP Rate
on real coding tasks

The 0% false positive rate on real coding tasks (intercode_benign dataset) is the number we are most careful about. A runtime monitor that fires on legitimate work is worse than no monitor: teams disable it within days. We tested against actual developer workflows and found zero false positives. That means your developers can keep Firmis Monitor running continuously without workflow interruption.

The sanity gate passed: holdout F1 (0.9535) exceeds full-set F1 by +0.0087. No overfitting detected. We flagged 2 zero-contribution rules for suppression; they matched nothing in either the training or holdout sets and are being removed in the next release.

Blast Radius: The Unique Signal

Every detected event is assigned an impact tier based on what the agent action could actually reach: files, credentials, network endpoints, other processes. Two actions with identical rule triggers can have radically different real-world consequences. Blast radius quantifies that difference.

Session TypeELEVATED+ Blast Radius
Agentic attacks (adversarial)80%
Model mistakes (benign intent, dangerous action)35%
Adversarial evasion (individual commands pass rules)25%
Benign coding sessions0%

The 35% on model mistakes is particularly important. When an agent performs a dangerous action from a benign intent, for example overwriting a config file because a user asked it to clean up the project, the intent is not malicious but the blast radius is still elevated. Policy rules based on user intent cannot catch this. Blast radius scoring can.

The 25% on adversarial evasion is also notable. Some attacks are designed to keep each individual command below rule thresholds. The blast radius scorer evaluates aggregate effect across a session, catching what per-command rules miss.

Feature Discrimination: What the Scorer Learned

Three features drove the strongest separation between attacks and benign sessions. Cohen's d values are from the holdout set:

1
outputEntropyd=1.716

Shannon entropy of command output. Attacks produce distinctive entropy patterns: very low (structured credential data) or very high (encrypted exfil payloads).

2
blastRadiusScored=1.659

Composite score of potential impact scope. Attacks reaching outside the project directory or touching credential files score significantly higher.

3
targetOutsideProjectd=1.187

Boolean flag for operations targeting paths outside the working directory. Simple signal, strong discriminator.

What This Means for Production Use

  • 84% detection on novel attacks with 0% false positives on real coding sessions. This is a usable signal, not a lab number.
  • The gaps are documented and bounded. If unicode steganography or multi-step wget chains are in your threat model, track the roadmap for those specific rule additions.
  • Blast radius scoring adds a layer that static rules cannot provide: it catches model mistakes and resource abuse independent of attack intent.
  • The scorer complements rules, it does not replace them. Both layers are active in production.
  • 79.6% on the Rehberger adversarial CVE set reflects real-world capability against published, expert-crafted attacks.

Run Firmis Monitor on your agent stack

  • Install: npx firmis-cli monitor (watches your coding agent sessions in real time)
  • Review the gap list above. If unicode-tag-injection or wget chains are in your threat model, track the roadmap for those rule additions.
  • The free tier includes passive monitoring. Active blocking and blast radius dashboards are Pro tier.

We will update this post as gap coverage improves. Every claim above will be re-evaluated against the same holdout sets so the numbers remain comparable across versions.

References & Sources

  1. [1]
    Rehberger CVE research across AI coding products- 49 cases, 13 products, maximally adversarial test set
  2. [2]
    InterCode benchmark (benign coding tasks)- Used for false positive validation (0% FP rate)
  3. [3]
    CVE-2025-59536: Claude Code hooks injection- One of 49 CVEs in the Rehberger adversarial set
  4. [4]
    Unicode tag steganography in prompt injection- Known gap: payloads hidden in Unicode tag characters (U+E0001-U+E007F)
  5. [5]
    MITRE ATT&CK for AI Systems- Attack category taxonomy used for classification
  6. [6]
    Firmis Monitor- Runtime protection with policy rules + systematic scorer + blast radius

Try It Now

Find out if your agent stack is safe

One command. 30 seconds. Free.

$npx firmis-cli init

Fix and Monitor included with Pro

View pricing