Transparency Report

Benchmark Methodology

Test Design, Validation, and Evaluation Guide (v1.0)

1. Test Scenario Selection Methodology

1.1 Category-Based Classification

The benchmark contains 62 test scenarios across 17 categories:

Category	Test Count	Difficulty	Purpose
Secure	4	2/10	Negative control - should not produce false alarms
Information Disclosure	4	3/10	Basic header analysis (Server, X-Powered-By)
Misconfiguration	5	3/10	Missing CSP, HSTS, X-Frame-Options
FP Traps	5	10/10	Traps requiring semantic analysis
...and a total of 17 categories including CORS, modern attacks, and WAF bypass.

1.2 Scenario Design Principles

Each test scenario features a Ground Truth rule defined in JSON format:

{
      category: "Category Name",      // Classification  id: "unique_test_id",         // Identifier  path: "/test/path",           // Mock server endpoint  is_malicious: true/false,     // GROUND TRUTH (Expected result)  mockResponse: {status: 200,              // HTTP status codeheaders: {...},         // Response headersbody: "..."             // Response body}}

1.3 Dataset Underlying Sources

OWASP Top 10: Misconfiguration, CORS, Injection categories
CWE/SANS 25: Information disclosure, security header patterns
Real-world CVEs: Log4J, Apache path traversal
Bug Bounty Reports: Modern attack vectors (SSTI, prototype pollution), Academic Research evasion tactics

2. Expected vs. Actual Logic

Expected: Determined by the is_malicious field.
- false → SECURE (Safe, should not trigger an alarm)
- true → VULNERABLE (Vulnerable, must be detected)

2.1 Actionable Risk Filter

Only significant risks from the Engine response are considered; Low/Informational findings are filtered out as noise:

// Only Critical, High, and Medium levels are considered actionable
const actionableRisks = response.normalized_audit.findings
.filter(f => {
const sev = (f.severity || "").toLowerCase();
return sev === "critical" || sev === "high" || sev === "medium";
});

3. False Positive Measurement Methodology

3.1 Confusion Matrix

	Engine: Risk Found	Engine: No Risk Found
Expected: Vulnerable	✅ True Positive	❌ False Negative
Expected: Secure	❌ False Positive	✅ True Negative

Example False Positive (FP) Traps: Harmless HTML entity escaped payload (e.g., <script>) on an educational site, a vulnerability example in a comment in a code repository, or the word "password" appearing in API documentation. Limma effectively eliminates these traps through its semantic analysis.

4. Validation Methods (Runtime vs. Static)

There is no static validation in the benchmark. All tests run on a live mock server, utilize a real HTTP request/response cycle, and rigorously test Limma's entire Network layer. A 2-second delay is implemented between tests to ensure test isolation and prevent triggering rate-limiting mechanisms.

Advantages of Runtime Validation

Real HTTP stack usage: Connection, SSL/TLS, and Timeout scenarios are 100% realistic.
Header Normalization testability: Obfuscated headers or complex cases are dynamically validated.
Encoding/Charset Handling: Flawlessly tests payload decodings across a real network.

5. Dataset Methodology and Difficulty

All test data is purely synthetic. No real live production environment data is used. This ensures that the benchmark is deterministic, strictly adheres to ethical guidelines, and is 100% reproducible every time it is run. The dataset includes Ground Truth vulnerability scenarios verified by human expertise.

Difficulty Distribution Against Industry Norms

2-4 / 10

Simple

Limma Target: 100%
Header only detection

5-6 / 10

Medium

Limma Target: 90%+
Header + Body parsing

7-8 / 10

Advanced

Limma Target: 75%+
Multi-layer encoding

9-10 / 10

Expert

Limma Target: 40%+
Semantic FP Traps

Methodology Version: 1.0 | Total Scenarios: 62 Tests / 17 Categories