Skip to main content

Local Corpus Build Scenario

This document logs the local exploratory workflow used to discover, download, normalize, sort, and clean public detection-rule sources. It is not the locked research methodology. The formal sampling and validation method still belongs in PROTOCOL.md, especially Sections 4 and 5.

The purpose of this scenario is to preserve operational knowledge while the protocol is still being written: what was attempted, why each step exists, what was kept for Phase 1, and what was removed as future scope or duplicate material.

Scenario Goal

Build a local working corpus from public detection-rule repositories so the project can later evaluate rule robustness under a controlled protocol.

The desired long-term pipeline is:

  1. Find public open-source detection-rule sources.
  2. Download source snapshots locally.
  3. Normalize or translate rules into analysis-friendly formats.
  4. Sort rules by source, format, MITRE metadata, and validation feasibility.
  5. Remove non-useful sources, duplicate artifacts, generated caches, and out-of-scope formats.
  6. Keep a small Phase 1 working set aligned with the current protocol direction.

Current Phase 1 Scope Used for Cleanup

The cleanup used the current working scope discussed during R4 planning:

CategoryStatusValidation direction
YARA file/content rulesKeptNative yara / yara-python validation
Elastic detection rulesKeptNative Elastic KQL/EQL validation
Sigma rulesKeptTranslate to Elastic/EQL where translation fidelity is high
YARA-L conversionsExploratory onlyRemoved from retained output until reproducibility and backend execution are decided
Splunk, Sentinel, Chronicle, Falco, Wazuh, Snort, Suricata, Panther, OpenSearchFuture/exploratoryRemoved from retained local working set

This scope is pragmatic, not final. PROTOCOL.md Sections 4.1, 4.2, 4.3, and 5.2 must still define the formal inclusion rules.

Step 1 - Find Public Rule Sources

Purpose: identify public rule repositories that could plausibly support a detection robustness benchmark.

Inputs:

Candidate sources reviewed:

  • SigmaHQ Sigma rules.
  • Elastic Detection Rules.
  • Splunk Security Content.
  • Public YARA repositories.
  • Elastic Protections Artifacts.
  • Microsoft Sentinel.
  • Chronicle / Google SecOps detection rules.
  • Google Security Analytics.
  • Panther Analysis.
  • Falco rules.
  • Wazuh rules.
  • OpenSearch Security Analytics.
  • Emerging Threats Suricata rules.
  • Snort community rules.
  • Discovery indexes such as Awesome YARA.

Why this step matters:

  • The benchmark cannot be credible if the source universe is vague.
  • Every source has different licensing, rule semantics, metadata quality, and evaluator requirements.
  • Source discovery must happen before sampling design so R5.2 can define a defensible sampling frame.

Output:

  • A documented source inventory with source URLs, download methods, local target paths, and R5.2 notes.

Step 2 - Download Local Source Snapshots

Purpose: create local snapshots for inspection and dataset preparation without relying on changing upstream repositories.

Initial local target:

/home/andrey/git-projects/brittle/detection rules

Download methods used:

  • git clone --depth 1 for ordinary GitHub repositories.
  • sparse Git checkout for very large repositories such as Azure Sentinel and Wazuh main.
  • curl -L for direct archive endpoints such as Sigma tarball, Emerging Threats, and Snort community rules.

Why this step matters:

  • Rule repositories change frequently.
  • Local snapshots allow repeatable parsing and cleanup.
  • Shallow/sparse retrieval reduces disk usage when only rule-bearing paths are needed.

Important result:

  • The initial broad download reached approximately 5.6G.
  • After cleanup, retained local raw sources were reduced to approximately 74M.

Step 3 - Convert or Normalize Rules

Purpose: move from heterogeneous upstream repositories toward formats that can be analyzed consistently.

Exploratory conversion outputs created by the local corpus builder included:

OutputPurposeFinal cleanup decision
01_inventory.jsonlUnified inventory of discovered rules and metadata.Kept, filtered to Phase 1 sources.
04_translated_sigma.jsonlSigma normalization/translation output.Kept, filtered to useful Sigma paths.
05_translated_elastic.jsonlElastic rule normalization output.Kept.
yara_file_rules/by_source/*.yarConsolidated native YARA bundles by source repository.Kept.
YARA-L outputExperimental single-format event-rule conversion.Removed from retained output for now.
Splunk/Sentinel/Chronicle/other translated outputsFuture-scope translation experiments.Removed from retained output.
Network translated outputFuture-scope Snort/Suricata-style material.Removed from retained output.

Why not keep YARA-L as the single retained format right now:

  • YARA-L is an event/log detection language, not a native file-signature evaluator.
  • Classic YARA rules should remain native for correctness.
  • Snort/Suricata packet rules, Falco runtime rules, Wazuh XML rules, and Splunk SPL rules lose semantics if forced into YARA-L without a formal translation-fidelity model.
  • YARA-L execution reproducibility still needs a clear local validation strategy or controlled Chronicle/Google SecOps access.

Current interpretation:

  • YARA-L conversion is useful as an exploratory normalization experiment.
  • It should not be treated as the confirmatory validation backend until R4/R5 define translation fidelity, evaluator availability, and exclusion rules.

Step 4 - Sort and Classify

Purpose: separate rules that are useful for the next protocol steps from rules that belong to future expansion.

Classification used for cleanup:

SourceKept?Reason
sigmahq-sigmaYesCore Phase 1 event/log rule source; candidate for Elastic/EQL translation.
elastic-detection-rulesYesCore Phase 1 native Elastic source.
yara-rulesYesCore native YARA source.
neo23x0-signature-baseYesCore native YARA source after pruning IOC/scripts.
elastic-protections-artifactsYesKept for YARA rules; non-YARA behavior/ransomware artifacts pruned.
reversinglabs-yara-rulesYesCore native YARA source.
inquest-yara-rulesYesCore native YARA source after removing non-rule artifacts.
stratosphere-yara-rulesYesSmall native YARA source.
malpedia-signator-rulesYesNative YARA source; should be flagged later as generated rules.
azure-sentinelNoFuture KQL scope; very large; not Phase 1.
splunk-security-contentNoFuture SPL scope; needs native SPL or fidelity model.
chronicle-detection-rulesNoFuture YARA-L scope; evaluator/reproducibility unresolved.
google-security-analyticsNoFuture cloud analytics scope.
panther-analysisNoFuture Python/YAML detection-as-code scope.
falco-rulesNoFuture runtime/container scope; native Falco preferred.
wazuh-main, wazuh-rulesetNoFuture Wazuh/OSSEC XML scope; native evaluator preferred.
opensearch-security-analyticsNoFuture or duplicate Sigma-derived source; double-counting risk.
Snort/Suricata archivesNoFuture network IDS scope; native packet/flow evaluator preferred.
awesome-yara-indexNoDiscovery index, not a rule corpus.

Why this step matters:

  • Keeping every downloaded source makes R4/R5 harder and increases accidental scope creep.
  • The first benchmark should be small enough to define precisely.
  • Out-of-scope sources can return later as exploratory or Phase 2 additions.

Step 5 - Clean Local Raw Sources

Purpose: keep only files useful for Phase 1 inspection and avoid carrying repository noise.

Actions taken:

  • Removed full future/exploratory source repositories.
  • Removed duplicate archive downloads after source directories existed.
  • Removed .git and .github metadata from retained local snapshots.
  • Pruned Sigma to rules/ and rules-emerging-threats/ plus basic license/readme files.
  • Pruned Elastic Detection Rules to rules/ plus basic license/readme/notice files.
  • Pruned Elastic Protections Artifacts to yara/ plus basic license/readme files.
  • Pruned Neo23x0 Signature Base to yara/, vendor/, license, and README.
  • Removed obvious non-rule artifacts such as PDFs, generator scripts, CI metadata, and duplicate/stub YARA files.

Why this step matters:

  • Reduces disk usage.
  • Reduces parser noise.
  • Makes later manual inspection easier.
  • Prevents future scripts from accidentally including docs, tests, caches, or unrelated formats.

Result:

  • Raw source folder: approximately 74M.
  • Remaining top-level raw sources: 9.
  • Remaining raw rule-like files: 9,372.
  • Exact duplicate raw rule-like files found after cleanup: 0.

Step 6 - Clean Generated Output

Purpose: retain only generated outputs that match the current Phase 1 direction.

Actions taken:

  • Removed duplicate packaged output directory brittle-corpus-v0.1.0.
  • Removed YARA-L generated output from retained local output.
  • Removed network, Splunk, Sentinel, Chronicle, and miscellaneous translated JSONL outputs.
  • Removed cache files and empty validation findings.
  • Filtered 01_inventory.jsonl to retained Phase 1 source repositories.
  • Filtered 04_translated_sigma.jsonl to useful Sigma rule paths.
  • Kept 05_translated_elastic.jsonl as the Elastic normalized output.
  • Kept yara_file_rules/by_source/*.yar as consolidated native YARA bundles.

Current retained generated output:

corpus-build/output/01_inventory.jsonl
corpus-build/output/04_translated_sigma.jsonl
corpus-build/output/05_translated_elastic.jsonl
corpus-build/output/yara_file_rules/_summary.json
corpus-build/output/yara_file_rules/by_source/*.yar

Result:

  • Generated output folder: approximately 67M.

Step 7 - Protect Local Artifacts from Git

Purpose: prevent local corpus artifacts from being committed before the protocol is locked.

Actions taken:

  • Added detection rules/ to .gitignore.
  • Added corpus-build/ to .gitignore.

Why this step matters:

  • The protocol currently says execution-phase artifacts should not enter the repository before lock.
  • Raw rule sources and generated corpora may have licensing and dual-use constraints.
  • Local artifacts can be rebuilt later from documented source URLs and snapshot metadata.

Current Local State

Raw source directory:

/home/andrey/git-projects/brittle/detection rules

Retained raw sources:

elastic-detection-rules
elastic-protections-artifacts
inquest-yara-rules
malpedia-signator-rules
neo23x0-signature-base
reversinglabs-yara-rules
sigmahq-sigma
stratosphere-yara-rules
yara-rules

Generated output directory:

/home/andrey/git-projects/brittle/corpus-build/output

Retained generated outputs:

01_inventory.jsonl
04_translated_sigma.jsonl
05_translated_elastic.jsonl
yara_file_rules/

Decisions Still Needed Before Formalizing

  • Whether Sigma confirmation is based on Elastic/EQL translation only, or whether untranslated Sigma rules are retained separately.
  • What translation-fidelity threshold excludes a Sigma rule from confirmatory analysis.
  • Whether generated YARA sources such as Malpedia Signator Rules are analyzed separately from human-authored YARA sources.
  • Whether Elastic Endgame/promotion rules are in scope or excluded as vendor-alert wrapper rules.
  • Whether rules-emerging-threats/ in Sigma is confirmatory or exploratory.
  • Whether YARA-L becomes a later secondary backend after local reproducibility is solved.
  • How duplicates are defined: byte-identical files, identical rule names, identical logic, or canonicalized rule AST equivalence.

Scenario Summary

The local workflow successfully reduced a broad public-rule download into a smaller Phase 1 working set. The cleanup intentionally favors correctness and reproducibility over breadth:

  • Native YARA stays native.
  • Elastic stays native.
  • Sigma is retained for controlled translation to Elastic/EQL.
  • YARA-L and other universal-conversion attempts are treated as exploratory until the protocol defines evaluator and fidelity rules.