New Research: 6,943 AI agent skills have security flaws. We scanned all 40,059. Read the report →
Back to Journal
ResearchMarch 24, 2026·8 min read

We Benchmarked 10 Agent Security Scanners Across 42 Repos

42 public repositories. 10 scanners. One judge model. We built the first independent benchmark for AI agent security tools.

TL;DR

  • 42 repos, 10 scanners, 4,168 findings judged. If you are choosing an agent security scanner, this is the data to base that decision on.
  • Firmis Deep: 66.9% precision. Its LLM layer eliminates 96.3% of noise, so your team reviews real threats instead of thousands of false positives.
  • Six scanners produced zero findings on agent-stack repos. They are built for different surfaces; if you run agent stacks, you need agent-aware tooling.
  • Two competitors surfaced findings but with lower precision. More findings is not better if most of them are wrong.

Benchmarks in security are hard to do well. Every tool has a different threat model, a different target surface, and different claims about what it catches. We wanted to find out how AI agent security scanners actually perform on real-world agent-stack code, so we built a structured evaluation and ran it ourselves.

This is that benchmark. The methodology, the numbers, and our honest interpretation of what they mean.

Methodology

42
Repos Evaluated
10
Scanners Tested
4,168
AI Judgments
8
Repo Categories

We selected 42 public GitHub repositories across eight categories designed to stress-test both detection breadth and false-positive behavior. Each scanner ran against the same corpus. All findings were then judged by a frontier AI model with full repository context and intent awareness, producing one of three verdicts: true positive (real threat), false positive (incorrect flag), or partial (ambiguous but not clearly false).

CategoryReposExamples
security_tools_fp_ground_truth6sqlmap, nuclei, ffuf, gitleaks, PentestGPT, trufflehog
deliberately_vulnerable3juice-shop, DVWA, NodeGoat
agent_harnesses5AutoGPT, Cline, Aider, Codex, OpenCode
agent_orchestration10LangChain, CrewAI, AutoGen, LlamaIndex, Swarm, ADK
mcp_ecosystem7MCP servers and integrations
skills_and_configs5Published agent skills and config bundles
workflow_automation3Langflow, Dify, Flowise
well_maintained_frameworks3Express, Flask, semantic-kernel

The security_tools_fp_ground_truth category deserves special mention. These are offensive security tools by design. Flagging sqlmap or trufflehog as threats is a false positive. We included them specifically to measure how well each scanner avoids flagging legitimate security tooling.

Results: All Scanners

Three of the ten scanners produced findings on this corpus. The other six did not, and we want to be precise about what that means: six scanners in our benchmark did not produce findings on the agent-stack repos we tested. This reflects their different design scope rather than a product failure. Tools built for container image scanning, dependency auditing, or DAST have different target surfaces. We included them to document the coverage gap, not to criticize the products.

ScannerFindingsTPFPPartialPrecision
Firmis Deep13993182866.9%
agent_audit34891723.5%
cisco_skill_scanner9728962.1%
snyk_agent_scan0---n/a
cisco_mcp_scanner0---n/a
aguara0---n/a
ferret_scan0---n/a
mcp_shield0---n/a
tencent0---n/a

Precision here is calculated as TP / (TP + FP). Partial findings are excluded from the denominator. This is intentionally conservative: a finding that is ambiguous is not credited as correct.

The Two-Layer Architecture

Raw static analysis produces enormous signal. The question is what you do with it.

One of the most instructive numbers in this benchmark is not precision but volume reduction. Firmis Static, the rule-based layer running without AI verification, produced 3,898 findings across the same 42 repos. Firmis Deep, after the LLM layer reviewed and verified each finding with full repo context, surfaced 139 findings at 66.9% precision.

3,898
Firmis Static Findings
2.6% precision
96.3%
Volume Reduced
95% of TPs preserved
139
Firmis Deep Findings
66.9% precision

The static layer casts a wide net intentionally. The LLM layer then applies context: what does this repo do, is this pattern actually dangerous here, does the surrounding code confirm or contradict the signal? For your security team, this means reviewing 139 findings instead of 3,898. The signal-to-noise ratio makes each finding actionable rather than a line item to dismiss.

Firmis Deep by Category

Precision was not uniform across categories, and the variation is meaningful.

CategoryPrecisionNotes
agent_harnesses91.3%21 TP out of 23 findings. Strongest category.
deliberately_vulnerable85.7%Known-bad repos. Confirms detection coverage.
workflow_automation81.8%Langflow, Dify, Flowise. High signal density.
mcp_ecosystem74.1%MCP servers and integrations. Good coverage.
security_tools (FP ground truth)12.5%Expected. These are offensive tools by design.

The 12.5% precision on security tools is correct behavior, not a failure. sqlmap and trufflehog are built to look dangerous. A scanner that does not flag them at all has likely missed real threats in other categories too. The LLM layer correctly suppressed most of these, but some ambiguous patterns made it through. We consider that an acceptable tradeoff.

Benchmark Limitations
This benchmark reflects one corpus at one point in time. Scanner rules update continuously. The 42 repos were chosen to cover representative agent-stack patterns, not to be exhaustive. Precision numbers should be interpreted as directional signals, not as stable product ratings. We plan to re-run quarterly.

What This Tells Us About the Category

The most striking finding is not Firmis's numbers. It is the coverage gap. Most of the tooling that organizations reach for when someone says "scan our security posture" was not designed for agent stacks and produced no signal here. That is not a criticism. It is a description of where the category is right now.

  • Agent-stack code has a distinct threat surface: tool permissions, config injection points, credential exposure through MCP topology, skill provenance.
  • Generic SAST and dependency scanners are looking for different patterns in different file types.
  • The scanners that did produce findings varied by more than 30x in precision, which means raw finding count is a poor proxy for usefulness.
  • Volume reduction matters as much as detection rate. 3,898 raw findings is not a usable security program.

Practical Takeaways

  • If your scanner produces zero findings on a live agent stack, verify it was designed for agent-stack surfaces before assuming you are clean.
  • Precision matters more than recall for developer-facing security tools. High noise kills adoption.
  • A two-layer approach (static rules plus LLM verification) significantly outperforms raw static analysis on precision without sacrificing most true positives.
  • The deliberately_vulnerable category (85.7% precision) confirms the detection engine is catching real patterns, not just matching on keywords.

The goal of a security benchmark is not to declare a winner. It is to understand what each tool is actually doing and where the gaps are.

Run It Yourself

The scanner behind Firmis Deep is available now. Zero install, any agent platform.

$ npx firmis-cli init

The free tier runs unlimited rule-based scans. Deep scan uses AI credits, with 50 included each month at no cost. No account required to start.

References & Sources

  1. [1]
    OWASP Juice Shop- Deliberately vulnerable web app (TP ground truth)
  2. [2]
    DVWA (Damn Vulnerable Web Application)- Deliberately vulnerable PHP app (TP ground truth)
  3. [3]
    sqlmap- Offensive security tool (FP ground truth)
  4. [4]
    LangChain- Agent orchestration framework (779 static findings, 10 deep, 7 TP)
  5. [5]
    CrewAI- Multi-agent orchestration (benchmark corpus)
  6. [6]
    Cline- AI coding agent (244 static findings, 7 deep, 6 TP)
  7. [7]
    Google ADK- Agent Development Kit (benchmark corpus)
  8. [8]
    Firmis Scanner (open source)- Apache-2.0, two-layer static + deep architecture

Try It Now

Find out if your agent stack is safe

One command. 30 seconds. Free.

$npx firmis-cli init

Fix and Monitor included with Pro

View pricing