How did you judge findings in the benchmark?

All findings from all scanners were submitted to a frontier AI model with full repository context and intent awareness. Each finding received one of three verdicts: true positive (genuine threat), false positive (incorrect flag), or partial (ambiguous). This produced 4,168 unique judgments across the 42-repo corpus. Precision is calculated as TP / (TP + FP), with partial findings excluded from the denominator.

Why did six scanners produce zero findings?

Six scanners in the benchmark did not produce findings on the agent-stack repos we tested. This reflects different design scope rather than a product failure. Tools built for container scanning, dependency auditing, or web application DAST have different target surfaces than agent harnesses, MCP servers, and skill configs. We included them to document the coverage gap, not to evaluate their performance on surfaces they were not designed for.

What is the difference between Firmis Static and Firmis Deep?

Firmis Static is the rule-based layer: 300+ detection rules that scan any agent platform without AI involvement. It produced 3,898 findings across the 42 repos at 2.6% precision. Firmis Deep adds an LLM verification layer that reviews each finding with full repository context, reducing volume by 96.3% while preserving 95% of true positives and reaching 66.9% precision. Static scanning is unlimited and free. Deep scan uses AI credits (50 included monthly at no cost).

Back to Journal

ResearchMarch 24, 2026·8 min read

We Benchmarked 10 Agent Security Scanners Across 42 Repos

42 public repositories. 10 scanners. One judge model. We built the first independent benchmark for AI agent security tools.

TL;DR

42 repos, 10 scanners, 4,168 findings judged. If you are choosing an agent security scanner, this is the data to base that decision on.
Firmis Deep: 66.9% precision. Its LLM layer eliminates 96.3% of noise, so your team reviews real threats instead of thousands of false positives.
Six scanners produced zero findings on agent-stack repos. They are built for different surfaces; if you run agent stacks, you need agent-aware tooling.
Two competitors surfaced findings but with lower precision. More findings is not better if most of them are wrong.

Benchmarks in security are hard to do well. Every tool has a different threat model, a different target surface, and different claims about what it catches. We wanted to find out how AI agent security scanners actually perform on real-world agent-stack code, so we built a structured evaluation and ran it ourselves.

This is that benchmark. The methodology, the numbers, and our honest interpretation of what they mean.

Methodology

Repos Evaluated

Scanners Tested

4,168

AI Judgments

Repo Categories

We selected 42 public GitHub repositories across eight categories designed to stress-test both detection breadth and false-positive behavior. Each scanner ran against the same corpus. All findings were then judged by a frontier AI model with full repository context and intent awareness, producing one of three verdicts: true positive (real threat), false positive (incorrect flag), or partial (ambiguous but not clearly false).

Category	Repos	Examples
security_tools_fp_ground_truth	6	sqlmap, nuclei, ffuf, gitleaks, PentestGPT, trufflehog
deliberately_vulnerable	3	juice-shop, DVWA, NodeGoat
agent_harnesses	5	AutoGPT, Cline, Aider, Codex, OpenCode
agent_orchestration	10	LangChain, CrewAI, AutoGen, LlamaIndex, Swarm, ADK
mcp_ecosystem	7	MCP servers and integrations
skills_and_configs	5	Published agent skills and config bundles
workflow_automation	3	Langflow, Dify, Flowise
well_maintained_frameworks	3	Express, Flask, semantic-kernel

The security_tools_fp_ground_truth category deserves special mention. These are offensive security tools by design. Flagging sqlmap or trufflehog as threats is a false positive. We included them specifically to measure how well each scanner avoids flagging legitimate security tooling.

Results: All Scanners

Three of the ten scanners produced findings on this corpus. The other six did not, and we want to be precise about what that means: six scanners in our benchmark did not produce findings on the agent-stack repos we tested. This reflects their different design scope rather than a product failure. Tools built for container image scanning, dependency auditing, or DAST have different target surfaces. We included them to document the coverage gap, not to criticize the products.

Scanner	Findings	TP	FP	Partial	Precision
Firmis Deep	139	93	18	28	66.9%
agent_audit	34	8	9	17	23.5%
cisco_skill_scanner	97	2	89	6	2.1%
snyk_agent_scan	0	-	-	-	n/a
cisco_mcp_scanner	0	-	-	-	n/a
aguara	0	-	-	-	n/a
ferret_scan	0	-	-	-	n/a
mcp_shield	0	-	-	-	n/a
tencent	0	-	-	-	n/a

Precision here is calculated as TP / (TP + FP). Partial findings are excluded from the denominator. This is intentionally conservative: a finding that is ambiguous is not credited as correct.

The Two-Layer Architecture

Raw static analysis produces enormous signal. The question is what you do with it.

One of the most instructive numbers in this benchmark is not precision but volume reduction. Firmis Static, the rule-based layer running without AI verification, produced 3,898 findings across the same 42 repos. Firmis Deep, after the LLM layer reviewed and verified each finding with full repo context, surfaced 139 findings at 66.9% precision.

3,898

Firmis Static Findings

2.6% precision

96.3%

Volume Reduced

95% of TPs preserved

139

Firmis Deep Findings

66.9% precision

The static layer casts a wide net intentionally. The LLM layer then applies context: what does this repo do, is this pattern actually dangerous here, does the surrounding code confirm or contradict the signal? For your security team, this means reviewing 139 findings instead of 3,898. The signal-to-noise ratio makes each finding actionable rather than a line item to dismiss.

Firmis Deep by Category

Precision was not uniform across categories, and the variation is meaningful.

Category	Precision	Notes
agent_harnesses	91.3%	21 TP out of 23 findings. Strongest category.
deliberately_vulnerable	85.7%	Known-bad repos. Confirms detection coverage.
workflow_automation	81.8%	Langflow, Dify, Flowise. High signal density.
mcp_ecosystem	74.1%	MCP servers and integrations. Good coverage.
security_tools (FP ground truth)	12.5%	Expected. These are offensive tools by design.

The 12.5% precision on security tools is correct behavior, not a failure. sqlmap and trufflehog are built to look dangerous. A scanner that does not flag them at all has likely missed real threats in other categories too. The LLM layer correctly suppressed most of these, but some ambiguous patterns made it through. We consider that an acceptable tradeoff.

Benchmark Limitations

This benchmark reflects one corpus at one point in time. Scanner rules update continuously. The 42 repos were chosen to cover representative agent-stack patterns, not to be exhaustive. Precision numbers should be interpreted as directional signals, not as stable product ratings. We plan to re-run quarterly.

What This Tells Us About the Category

The most striking finding is not Firmis's numbers. It is the coverage gap. Most of the tooling that organizations reach for when someone says "scan our security posture" was not designed for agent stacks and produced no signal here. That is not a criticism. It is a description of where the category is right now.

Agent-stack code has a distinct threat surface: tool permissions, config injection points, credential exposure through MCP topology, skill provenance.
Generic SAST and dependency scanners are looking for different patterns in different file types.
The scanners that did produce findings varied by more than 30x in precision, which means raw finding count is a poor proxy for usefulness.
Volume reduction matters as much as detection rate. 3,898 raw findings is not a usable security program.

Practical Takeaways

→If your scanner produces zero findings on a live agent stack, verify it was designed for agent-stack surfaces before assuming you are clean.
→Precision matters more than recall for developer-facing security tools. High noise kills adoption.
→A two-layer approach (static rules plus LLM verification) significantly outperforms raw static analysis on precision without sacrificing most true positives.
→The deliberately_vulnerable category (85.7% precision) confirms the detection engine is catching real patterns, not just matching on keywords.

The goal of a security benchmark is not to declare a winner. It is to understand what each tool is actually doing and where the gaps are.

Run It Yourself

The scanner behind Firmis Deep is available now. Zero install, any agent platform.

$ npx firmis-cli init

The free tier runs unlimited rule-based scans. Deep scan uses AI credits, with 50 included each month at no cost. No account required to start.

References & Sources

[1]
OWASP Juice Shop- Deliberately vulnerable web app (TP ground truth)
[2]
DVWA (Damn Vulnerable Web Application)- Deliberately vulnerable PHP app (TP ground truth)
[3]
sqlmap- Offensive security tool (FP ground truth)
[4]
LangChain- Agent orchestration framework (779 static findings, 10 deep, 7 TP)
[5]
CrewAI- Multi-agent orchestration (benchmark corpus)
[6]
Cline- AI coding agent (244 static findings, 7 deep, 6 TP)
[7]
Google ADK- Agent Development Kit (benchmark corpus)
[8]
Firmis Scanner (open source)- Apache-2.0, two-layer static + deep architecture

PreviousThe OpenClaw Crisis: What the First AI Agent Security Nightmare Means for Your Business

Try It Now

Find out if your agent stack is safe

One command. 30 seconds. Free.

$npx firmis-cli init

Fix and Monitor included with Pro

View pricing