Transparency Report

Benchmark Methodology

Test Design, Validation, and Evaluation Guide (v1.0)

1. Test Scenario Selection Methodology

1.1 Category-Based Classification

The benchmark contains 62 test scenarios across 17 categories:

CategoryTest CountDifficultyPurpose
Secure42/10Negative control - should not produce false alarms
Information Disclosure43/10Basic header analysis (Server, X-Powered-By)
Misconfiguration53/10Missing CSP, HSTS, X-Frame-Options
FP Traps510/10Traps requiring semantic analysis
...and a total of 17 categories including CORS, modern attacks, and WAF bypass.

1.2 Scenario Design Principles

Each test scenario features a Ground Truth rule defined in JSON format:

{
      category: "Category Name",      // Classification  id: "unique_test_id",         // Identifier  path: "/test/path",           // Mock server endpoint  is_malicious: true/false,     // GROUND TRUTH (Expected result)  mockResponse: {status: 200,              // HTTP status codeheaders: {...},         // Response headersbody: "..."             // Response body}}

1.3 Dataset Underlying Sources

  • OWASP Top 10: Misconfiguration, CORS, Injection categories
  • CWE/SANS 25: Information disclosure, security header patterns
  • Real-world CVEs: Log4J, Apache path traversal
  • Bug Bounty Reports: Modern attack vectors (SSTI, prototype pollution), Academic Research evasion tactics

2. Expected vs. Actual Logic

Expected: Determined by the is_malicious field.
- falseSECURE (Safe, should not trigger an alarm)
- trueVULNERABLE (Vulnerable, must be detected)

2.1 Actionable Risk Filter

Only significant risks from the Engine response are considered; Low/Informational findings are filtered out as noise:

// Only Critical, High, and Medium levels are considered actionable
const actionableRisks = response.normalized_audit.findings
.filter(f => {
const sev = (f.severity || "").toLowerCase();
return sev === "critical" || sev === "high" || sev === "medium";
});

3. False Positive Measurement Methodology

3.1 Confusion Matrix

Engine: Risk FoundEngine: No Risk Found
Expected: Vulnerable✅ True Positive❌ False Negative
Expected: Secure❌ False Positive✅ True Negative

Example False Positive (FP) Traps: Harmless HTML entity escaped payload (e.g., <script>) on an educational site, a vulnerability example in a comment in a code repository, or the word "password" appearing in API documentation. Limma effectively eliminates these traps through its semantic analysis.

4. Validation Methods (Runtime vs. Static)

There is no static validation in the benchmark. All tests run on a live mock server, utilize a real HTTP request/response cycle, and rigorously test Limma's entire Network layer. A 2-second delay is implemented between tests to ensure test isolation and prevent triggering rate-limiting mechanisms.

Advantages of Runtime Validation

  • Real HTTP stack usage: Connection, SSL/TLS, and Timeout scenarios are 100% realistic.
  • Header Normalization testability: Obfuscated headers or complex cases are dynamically validated.
  • Encoding/Charset Handling: Flawlessly tests payload decodings across a real network.

5. Dataset Methodology and Difficulty

All test data is purely synthetic. No real live production environment data is used. This ensures that the benchmark is deterministic, strictly adheres to ethical guidelines, and is 100% reproducible every time it is run. The dataset includes Ground Truth vulnerability scenarios verified by human expertise.

Difficulty Distribution Against Industry Norms

2-4 / 10
Simple
Limma Target: 100%
Header only detection
5-6 / 10
Medium
Limma Target: 90%+
Header + Body parsing
7-8 / 10
Advanced
Limma Target: 75%+
Multi-layer encoding
9-10 / 10
Expert
Limma Target: 40%+
Semantic FP Traps
Methodology Version: 1.0 | Total Scenarios: 62 Tests / 17 Categories