Benchmark Methodology
Test Design, Validation, and Evaluation Guide (v1.0)
1. Test Scenario Selection Methodology
1.1 Category-Based Classification
The benchmark contains 62 test scenarios across 17 categories:
| Category | Test Count | Difficulty | Purpose |
|---|---|---|---|
| Secure | 4 | 2/10 | Negative control - should not produce false alarms |
| Information Disclosure | 4 | 3/10 | Basic header analysis (Server, X-Powered-By) |
| Misconfiguration | 5 | 3/10 | Missing CSP, HSTS, X-Frame-Options |
| FP Traps | 5 | 10/10 | Traps requiring semantic analysis |
| ...and a total of 17 categories including CORS, modern attacks, and WAF bypass. | |||
1.2 Scenario Design Principles
Each test scenario features a Ground Truth rule defined in JSON format:
{ category: "Category Name", // Classification id: "unique_test_id", // Identifier path: "/test/path", // Mock server endpoint is_malicious: true/false, // GROUND TRUTH (Expected result) mockResponse: {status: 200, // HTTP status codeheaders: {...}, // Response headersbody: "..." // Response body}}
1.3 Dataset Underlying Sources
- OWASP Top 10: Misconfiguration, CORS, Injection categories
- CWE/SANS 25: Information disclosure, security header patterns
- Real-world CVEs: Log4J, Apache path traversal
- Bug Bounty Reports: Modern attack vectors (SSTI, prototype pollution), Academic Research evasion tactics
2. Expected vs. Actual Logic
Expected: Determined by the is_malicious field.
- false → SECURE (Safe, should not trigger an alarm)
- true → VULNERABLE (Vulnerable, must be detected)
2.1 Actionable Risk Filter
Only significant risks from the Engine response are considered; Low/Informational findings are filtered out as noise:
// Only Critical, High, and Medium levels are considered actionable
const actionableRisks = response.normalized_audit.findings
.filter(f => {
const sev = (f.severity || "").toLowerCase();
return sev === "critical" || sev === "high" || sev === "medium";
});
3. False Positive Measurement Methodology
3.1 Confusion Matrix
| Engine: Risk Found | Engine: No Risk Found | |
|---|---|---|
| Expected: Vulnerable | ✅ True Positive | ❌ False Negative |
| Expected: Secure | ❌ False Positive | ✅ True Negative |
Example False Positive (FP) Traps: Harmless HTML entity escaped payload (e.g., <script>) on an educational site, a vulnerability example in a comment in a code repository, or the word "password" appearing in API documentation. Limma effectively eliminates these traps through its semantic analysis.
4. Validation Methods (Runtime vs. Static)
There is no static validation in the benchmark. All tests run on a live mock server, utilize a real HTTP request/response cycle, and rigorously test Limma's entire Network layer. A 2-second delay is implemented between tests to ensure test isolation and prevent triggering rate-limiting mechanisms.
Advantages of Runtime Validation
- Real HTTP stack usage: Connection, SSL/TLS, and Timeout scenarios are 100% realistic.
- Header Normalization testability: Obfuscated headers or complex cases are dynamically validated.
- Encoding/Charset Handling: Flawlessly tests payload decodings across a real network.
5. Dataset Methodology and Difficulty
All test data is purely synthetic. No real live production environment data is used. This ensures that the benchmark is deterministic, strictly adheres to ethical guidelines, and is 100% reproducible every time it is run. The dataset includes Ground Truth vulnerability scenarios verified by human expertise.
Difficulty Distribution Against Industry Norms
Header only detection
Header + Body parsing
Multi-layer encoding
Semantic FP Traps