We Benchmarked 10 Agent Security Scanners Across 42 Repos
42 public repositories. 10 scanners. One judge model. We built the first independent benchmark for AI agent security tools.
TL;DR
- 42 repos, 10 scanners, 4,168 findings judged. If you are choosing an agent security scanner, this is the data to base that decision on.
- Firmis Deep: 66.9% precision. Its LLM layer eliminates 96.3% of noise, so your team reviews real threats instead of thousands of false positives.
- Six scanners produced zero findings on agent-stack repos. They are built for different surfaces; if you run agent stacks, you need agent-aware tooling.
- Two competitors surfaced findings but with lower precision. More findings is not better if most of them are wrong.
Benchmarks in security are hard to do well. Every tool has a different threat model, a different target surface, and different claims about what it catches. We wanted to find out how AI agent security scanners actually perform on real-world agent-stack code, so we built a structured evaluation and ran it ourselves.
This is that benchmark. The methodology, the numbers, and our honest interpretation of what they mean.
Methodology
We selected 42 public GitHub repositories across eight categories designed to stress-test both detection breadth and false-positive behavior. Each scanner ran against the same corpus. All findings were then judged by a frontier AI model with full repository context and intent awareness, producing one of three verdicts: true positive (real threat), false positive (incorrect flag), or partial (ambiguous but not clearly false).
| Category | Repos | Examples |
|---|---|---|
| security_tools_fp_ground_truth | 6 | sqlmap, nuclei, ffuf, gitleaks, PentestGPT, trufflehog |
| deliberately_vulnerable | 3 | juice-shop, DVWA, NodeGoat |
| agent_harnesses | 5 | AutoGPT, Cline, Aider, Codex, OpenCode |
| agent_orchestration | 10 | LangChain, CrewAI, AutoGen, LlamaIndex, Swarm, ADK |
| mcp_ecosystem | 7 | MCP servers and integrations |
| skills_and_configs | 5 | Published agent skills and config bundles |
| workflow_automation | 3 | Langflow, Dify, Flowise |
| well_maintained_frameworks | 3 | Express, Flask, semantic-kernel |
The security_tools_fp_ground_truth category deserves special mention. These are offensive security tools by design. Flagging sqlmap or trufflehog as threats is a false positive. We included them specifically to measure how well each scanner avoids flagging legitimate security tooling.
Results: All Scanners
Three of the ten scanners produced findings on this corpus. The other six did not, and we want to be precise about what that means: six scanners in our benchmark did not produce findings on the agent-stack repos we tested. This reflects their different design scope rather than a product failure. Tools built for container image scanning, dependency auditing, or DAST have different target surfaces. We included them to document the coverage gap, not to criticize the products.
| Scanner | Findings | TP | FP | Partial | Precision |
|---|---|---|---|---|---|
| Firmis Deep | 139 | 93 | 18 | 28 | 66.9% |
| agent_audit | 34 | 8 | 9 | 17 | 23.5% |
| cisco_skill_scanner | 97 | 2 | 89 | 6 | 2.1% |
| snyk_agent_scan | 0 | - | - | - | n/a |
| cisco_mcp_scanner | 0 | - | - | - | n/a |
| aguara | 0 | - | - | - | n/a |
| ferret_scan | 0 | - | - | - | n/a |
| mcp_shield | 0 | - | - | - | n/a |
| tencent | 0 | - | - | - | n/a |
Precision here is calculated as TP / (TP + FP). Partial findings are excluded from the denominator. This is intentionally conservative: a finding that is ambiguous is not credited as correct.
The Two-Layer Architecture
Raw static analysis produces enormous signal. The question is what you do with it.
One of the most instructive numbers in this benchmark is not precision but volume reduction. Firmis Static, the rule-based layer running without AI verification, produced 3,898 findings across the same 42 repos. Firmis Deep, after the LLM layer reviewed and verified each finding with full repo context, surfaced 139 findings at 66.9% precision.
The static layer casts a wide net intentionally. The LLM layer then applies context: what does this repo do, is this pattern actually dangerous here, does the surrounding code confirm or contradict the signal? For your security team, this means reviewing 139 findings instead of 3,898. The signal-to-noise ratio makes each finding actionable rather than a line item to dismiss.
Firmis Deep by Category
Precision was not uniform across categories, and the variation is meaningful.
| Category | Precision | Notes |
|---|---|---|
| agent_harnesses | 91.3% | 21 TP out of 23 findings. Strongest category. |
| deliberately_vulnerable | 85.7% | Known-bad repos. Confirms detection coverage. |
| workflow_automation | 81.8% | Langflow, Dify, Flowise. High signal density. |
| mcp_ecosystem | 74.1% | MCP servers and integrations. Good coverage. |
| security_tools (FP ground truth) | 12.5% | Expected. These are offensive tools by design. |
The 12.5% precision on security tools is correct behavior, not a failure. sqlmap and trufflehog are built to look dangerous. A scanner that does not flag them at all has likely missed real threats in other categories too. The LLM layer correctly suppressed most of these, but some ambiguous patterns made it through. We consider that an acceptable tradeoff.
What This Tells Us About the Category
The most striking finding is not Firmis's numbers. It is the coverage gap. Most of the tooling that organizations reach for when someone says "scan our security posture" was not designed for agent stacks and produced no signal here. That is not a criticism. It is a description of where the category is right now.
- Agent-stack code has a distinct threat surface: tool permissions, config injection points, credential exposure through MCP topology, skill provenance.
- Generic SAST and dependency scanners are looking for different patterns in different file types.
- The scanners that did produce findings varied by more than 30x in precision, which means raw finding count is a poor proxy for usefulness.
- Volume reduction matters as much as detection rate. 3,898 raw findings is not a usable security program.
Practical Takeaways
- →If your scanner produces zero findings on a live agent stack, verify it was designed for agent-stack surfaces before assuming you are clean.
- →Precision matters more than recall for developer-facing security tools. High noise kills adoption.
- →A two-layer approach (static rules plus LLM verification) significantly outperforms raw static analysis on precision without sacrificing most true positives.
- →The deliberately_vulnerable category (85.7% precision) confirms the detection engine is catching real patterns, not just matching on keywords.
The goal of a security benchmark is not to declare a winner. It is to understand what each tool is actually doing and where the gaps are.
Run It Yourself
The scanner behind Firmis Deep is available now. Zero install, any agent platform.
The free tier runs unlimited rule-based scans. Deep scan uses AI credits, with 50 included each month at no cost. No account required to start.
References & Sources
- [1]OWASP Juice Shop- Deliberately vulnerable web app (TP ground truth)
- [2]DVWA (Damn Vulnerable Web Application)- Deliberately vulnerable PHP app (TP ground truth)
- [3]sqlmap- Offensive security tool (FP ground truth)
- [4]LangChain- Agent orchestration framework (779 static findings, 10 deep, 7 TP)
- [5]CrewAI- Multi-agent orchestration (benchmark corpus)
- [6]Cline- AI coding agent (244 static findings, 7 deep, 6 TP)
- [7]Google ADK- Agent Development Kit (benchmark corpus)
- [8]Firmis Scanner (open source)- Apache-2.0, two-layer static + deep architecture
Try It Now
Find out if your agent stack is safe
One command. 30 seconds. Free.
Fix and Monitor included with Pro
View pricing