docs: add wiki pages for all audiences, fix .old.yml leak

9 new wiki pages: getting-started, faq, troubleshooting, advanced-usage, verification-modes, adding-a-platform, adding-a-scraper, testing-guide, release-process. Updated architecture.md with mermaid diagrams, tools.md with full pipeline and target/exporter sections, profiling.md with missing fields, index.md with glossary and nav links. Expanded CONTRIBUTING.md from stub to full contributor guide. Filter .old.yml from load_emulator_profiles, generate_db alias collection, and generate_readme counts. Fix BizHawk sha1 mode in tools.md, fix RetroPie path, fix export_truth.py typos.
2026-06-12 10:05:24 -05:00 · 2026-03-30 22:51:29 +02:00
parent 038c3d3b40
commit d4b0af0a38
20 changed files with 2742 additions and 63 deletions
@@ -1,14 +1,109 @@
 # Contributing to RetroBIOS

-## Add a BIOS file
+## Types of contributions

-1. Fork this repository
-2. Place the file in `bios/Manufacturer/Console/filename`
-3. Variants (alternate hashes): `bios/Manufacturer/Console/.variants/`
-4. Create a Pull Request - checksums are verified automatically
+- **Add a BIOS file** - a great way to get started. Fork, add the file, open a PR.
+- **Create an emulator profile** - document what a core actually loads from source code. See the [profiling guide](https://abdess.github.io/retrobios/wiki/profiling/).
+- **Add a platform** - integrate a new frontend (scraper + YAML config). See [adding a platform](https://abdess.github.io/retrobios/wiki/adding-a-platform/).
+- **Add or fix a scraper** - parse upstream sources for BIOS requirements. See [adding a scraper](https://abdess.github.io/retrobios/wiki/adding-a-scraper/).
+- **Fix a bug or improve tooling** - Python scripts in `scripts/`, single dependency (`pyyaml`).

-## File conventions
+## Local setup

- Files >50 MB go in GitHub release assets (`large-files` release)
- RPG Maker and ScummVM directories are excluded from deduplication
- See the [documentation site](https://abdess.github.io/retrobios/) for full details
+```bash
+git clone https://github.com/Abdess/retrobios.git
+cd retrobios
+pip install pyyaml
+
+# run tests
+python -m unittest tests.test_e2e -v
+
+# run full pipeline (DB + verify + packs + consistency check)
+python scripts/pipeline.py --offline
+```
+
+Requires Python 3.10 or later.
+
+## Adding a BIOS file
+
+1. Place the file in `bios/Manufacturer/Console/filename`.
+2. Alternate versions (different hash, same purpose) go in `bios/Manufacturer/Console/.variants/`.
+3. Files over 50 MB go as assets on the `large-files` GitHub release (git handles them better that way).
+4. RPG Maker and ScummVM directories are excluded from deduplication - please keep their structure as-is.
+5. Open a pull request. CI validates checksums automatically and posts a report.
+
+## Commit conventions
+
+Format: `type: description` (50 characters max, lowercase start).
+
+Allowed types: `feat`, `refactor`, `chore`, `docs`, `fix`.
+
+```
+feat: add panasonic 3do bios files
+docs: update architecture diagram
+fix: resolve truncated md5 matching
+chore: remove unused test fixtures
+refactor: extract hash logic to common.py
+```
+
+Keep messages factual. No marketing language, no superfluous adjectives.
+
+## Code and documentation quality
+
+The codebase runs on Python 3.10+ with a single dependency (`pyyaml`). All modules
+include `from __future__ import annotations` at the top. Type hints on every function
+signature, `pathlib` instead of `os.path`, and dataclasses where a plain class would
+just hold attributes.
+
+On performance: O(1) or O(n) algorithms are preferred. If something needs O(n^2), a
+comment explaining why helps future readers. List comprehensions over explicit loops,
+generators when iterating large datasets, and standard default arguments
+(`def f(items=None)` over `def f(items=[])`).
+
+File I/O uses context managers. ZIP extraction goes through `safe_extract_zip()` in
+`common.py`, which prevents zip-slip path traversal.
+
+The code stays lean. Comments that describe *why* age better than comments that
+restate *what*. Unused variables can be deleted rather than renamed with an underscore.
+
+The same spirit applies to documentation and emulator profiles. Straightforward
+language, honest labels ("untested" when something is untested).
+
+When a bug is spotted while working on something else, fixing it in the same PR
+keeps things tidy. Features ideally ship complete in one pass: code, tests, a
+passing pipeline run, and any documentation updates together.
+
+## Pull request process
+
+CI runs four checks on every PR:
+
+| Check | What it does |
+|-------|-------------|
+| `validate-bios` | hashes changed BIOS files against the database, posts a validation report |
+| `validate-configs` | schema-validates platform YAML configs |
+| `run-tests` | runs the full E2E test suite |
+| `label-pr` | auto-labels the PR based on changed file paths |
+
+All four checks must pass. One maintainer approval is required before merge.
+
+## Project philosophy
+
+This repository is an archive. The goal is a complete collection of every file a
+user needs to run every supported console in an emulator, with zero friction. That
+scope goes beyond BIOS ROMs: firmware updates, system fonts, data files, encryption
+keys, hiscore databases, anything the emulator loads from disk rather than generating
+itself. In a hundred years the pack should still work out of the box.
+
+Licensing considerations are secondary to completeness. If an emulator needs a
+file to function, the file belongs in the collection. The project's legal position is
+documented in the FAQ section of the site.
+
+Integrity matters. Every file, hash, and metadata field is cross-checked
+against the emulator's source code. Upstream references like System.dat,
+`.info` files, and wiki pages are valuable and generally accurate, though
+they can occasionally fall out of date. When an upstream source and the
+code disagree, the code at runtime is the tiebreaker.
+
+## Documentation
+
+Full reference docs, profiling guides, and architecture details are on the [documentation site](https://abdess.github.io/retrobios/).
@@ -2,7 +2,7 @@

 Complete BIOS and firmware packs for Batocera, BizHawk, EmuDeck, Lakka, Recalbox, RetroArch, RetroBat, RetroDECK, RetroPie, and RomM.

-**7,239** verified files across **396** systems, ready to extract into your emulator's BIOS directory.
+**7,241** verified files across **396** systems, ready to extract into your emulator's BIOS directory.

 ## Quick Install

@@ -46,7 +46,7 @@ Each file is checked against the emulator's source code to match what the code a
 - **10 platforms** supported with platform-specific verification
 - **329 emulators** profiled from source (RetroArch cores + standalone)
 - **396 systems** covered (NES, SNES, PlayStation, Saturn, Dreamcast, ...)
- **7,239 files** verified with MD5, SHA1, CRC32 checksums
+- **7,241 files** verified with MD5, SHA1, CRC32 checksums
 - **8144 MB** total collection size

 ## Supported systems
@@ -65,7 +65,7 @@ Full list with per-file details: **[https://abdess.github.io/retrobios/](https:/
 | Lakka | 442/448 (98.7%) | 442 | 0 | 6 |
 | Recalbox | 277/346 (80.1%) | 274 | 3 | 69 |
 | RetroArch | 442/448 (98.7%) | 442 | 0 | 6 |
-| RetroBat | 337/339 (99.4%) | 333 | 4 | 2 |
+| RetroBat | 339/339 (100.0%) | 335 | 4 | 0 |
 | RetroDECK | 1960/2006 (97.7%) | 1934 | 26 | 46 |
 | RetroPie | 442/448 (98.7%) | 442 | 0 | 6 |
 | RomM | 372/374 (99.5%) | 372 | 0 | 2 |
@@ -130,4 +130,4 @@ See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

 This repository provides BIOS files for personal backup and archival purposes.

-*Auto-generated on 2026-03-30T12:09:51Z*
+*Auto-generated on 2026-03-30T20:16:27Z*
@@ -132,7 +132,7 @@ nav:
  - ZC: systems/zc.md
 - Emulators:
  - Overview: emulators/index.md
-  - Official ports (62):
+  - Official ports (63):
    - amiarcadia: emulators/amiarcadia.md
    - Amiberry: emulators/amiberry.md
    - Ardens: emulators/ardens.md
@@ -180,6 +180,7 @@ nav:
    - PicoDrive: emulators/picodrive.md
    - play: emulators/play.md
    - PPSSPP: emulators/ppsspp.md
+    - RPCS3: emulators/rpcs3.md
    - Rustation: emulators/rustation.md
    - RVVM: emulators/rvvm.md
    - SameBoy: emulators/sameboy.md
@@ -429,7 +430,7 @@ nav:
    - PCSX-ReARMed: emulators/pcsx_rearmed.md
  - Launchers (1):
    - Dolphin Launcher: emulators/dolphin_launcher.md
-  - Other (24):
+  - Other (23):
    - ares: emulators/ares.md
    - Beetle GBA (Mednafen): emulators/beetle_gba.md
    - BigPEmu: emulators/bigpemu.md
@@ -442,7 +443,6 @@ nav:
    - Model 2 Emulator: emulators/model2.md
    - openMSX: emulators/openmsx.md
    - Redream: emulators/redream.md
-    - RPCS3: emulators/rpcs3.md
    - Ryujinx: emulators/ryujinx.md
    - shadps4: emulators/shadps4.md
    - Supermodel: emulators/supermodel.md
@@ -458,8 +458,17 @@ nav:
 - Gap Analysis: gaps.md
 - Wiki:
  - Overview: wiki/index.md
+  - Getting started: wiki/getting-started.md
+  - FAQ: wiki/faq.md
+  - Troubleshooting: wiki/troubleshooting.md
  - Architecture: wiki/architecture.md
  - Tools: wiki/tools.md
-  - Profiling guide: wiki/profiling.md
+  - Advanced usage: wiki/advanced-usage.md
+  - Verification modes: wiki/verification-modes.md
  - Data model: wiki/data-model.md
+  - Profiling guide: wiki/profiling.md
+  - Adding a platform: wiki/adding-a-platform.md
+  - Adding a scraper: wiki/adding-a-scraper.md
+  - Testing guide: wiki/testing-guide.md
+  - Release process: wiki/release-process.md
 - Contributing: contributing.md
@@ -683,6 +683,8 @@ def load_emulator_profiles(
    if not emu_path.exists():
        return profiles
    for f in sorted(emu_path.glob("*.yml")):
+        if f.name.endswith(".old.yml"):
+            continue
        with open(f) as fh:
            profile = yaml.safe_load(fh) or {}
        if "emulator" not in profile:
@@ -401,6 +401,8 @@ def _collect_all_aliases(files: dict) -> dict:
        try:
            import yaml
            for emu_file in emulators_dir.glob("*.yml"):
+                if emu_file.name.endswith(".old.yml"):
+                    continue
                try:
                    with open(emu_file) as f:
                        emu_config = yaml.safe_load(f) or {}
@@ -91,6 +91,7 @@ def generate_readme(db: dict, platforms_dir: str) -> str:

    emulator_count = sum(
        1 for f in Path("emulators").glob("*.yml")
+        if not f.name.endswith(".old.yml")
    ) if Path("emulators").exists() else 0

    # Count systems from emulator profiles
@@ -100,6 +101,8 @@ def generate_readme(db: dict, platforms_dir: str) -> str:
        try:
            import yaml
            for f in emu_dir.glob("*.yml"):
+                if f.name.endswith(".old.yml"):
+                    continue
                with open(f) as fh:
                    p = yaml.safe_load(fh) or {}
                system_ids.update(p.get("systems", []))
@@ -1358,6 +1358,8 @@ The CI automatically:


 # Wiki pages
+# index, architecture, tools, profiling are maintained as wiki/ sources
+# and copied verbatim by main(). Only data-model is generated dynamically.

 def generate_wiki_index() -> str:
    """Generate wiki landing page."""
@@ -1994,10 +1996,19 @@ def generate_mkdocs_nav(

    wiki_nav = [
        {"Overview": "wiki/index.md"},
+        {"Getting started": "wiki/getting-started.md"},
+        {"FAQ": "wiki/faq.md"},
+        {"Troubleshooting": "wiki/troubleshooting.md"},
        {"Architecture": "wiki/architecture.md"},
        {"Tools": "wiki/tools.md"},
-        {"Profiling guide": "wiki/profiling.md"},
+        {"Advanced usage": "wiki/advanced-usage.md"},
+        {"Verification modes": "wiki/verification-modes.md"},
        {"Data model": "wiki/data-model.md"},
+        {"Profiling guide": "wiki/profiling.md"},
+        {"Adding a platform": "wiki/adding-a-platform.md"},
+        {"Adding a scraper": "wiki/adding-a-scraper.md"},
+        {"Testing guide": "wiki/testing-guide.md"},
+        {"Release process": "wiki/release-process.md"},
    ]

    return [
@@ -2180,7 +2191,7 @@ plugins:
        + 1  # cross-reference
        + 1 + len(profiles)  # emulator index + detail
        + 1  # gap analysis
-        + 5  # wiki (index, architecture, tools, profiling, data model)
+        + 14  # wiki pages (copied from wiki/ + generated data-model)
        + 1  # contributing
    )
    print(f"\nGenerated {total_pages} pages in {args.docs_dir}/")
@@ -0,0 +1,352 @@
+# Adding a platform
+
+How to add support for a new retrogaming platform (e.g. a frontend like Batocera,
+a manager like EmuDeck, or a firmware database like BizHawk).
+
+## Prerequisites
+
+Before starting, gather the following from the upstream project:
+
+- **Where does it define BIOS requirements?** Each platform has a canonical source:
+  a DAT file, a JSON fixture, an XML manifest, a Bash script, a C# database, etc.
+- **What verification mode does it use?** Read the platform source code to determine
+  how it checks BIOS files at runtime: file existence only (`existence`), MD5 hash
+  matching (`md5`), SHA1 matching (`sha1`), or a combination of size and hash.
+- **What is the base destination?** The directory name where BIOS files are placed
+  on disk (e.g. `system` for RetroArch, `bios` for Batocera, `Firmware` for BizHawk).
+- **What hash type does it store?** The primary hash format used in the platform's
+  own data files (SHA1 for RetroArch/BizHawk, MD5 for Batocera/Recalbox/EmuDeck).
+
+## Step 1: Create the scraper
+
+Scrapers live in `scripts/scraper/` and are auto-discovered by the plugin system.
+Any file matching `*_scraper.py` in that directory is loaded at import time via
+`pkgutil.iter_modules`. No registration step is needed beyond placing the file.
+
+### Module contract
+
+The module must export two names:
+
+```python
+PLATFORM_NAME = "myplatform"  # matches the key in _registry.yml
+
+class Scraper(BaseScraper):
+    ...
+```
+
+### Inheriting BaseScraper
+
+`BaseScraper` provides:
+
+- `_fetch_raw() -> str` - HTTP GET with 50 MB response limit, cached after first call.
+  Uses `urllib.request` with a `retrobios-scraper/1.0` user-agent and 30s timeout.
+- `compare_with_config(config) -> ChangeSet` - diffs scraped requirements against
+  an existing platform YAML, returning added/removed/modified entries.
+- `test_connection() -> bool` - checks if the source URL is reachable.
+
+Two abstract methods must be implemented:
+
+```python
+def fetch_requirements(self) -> list[BiosRequirement]:
+    """Parse the upstream source and return one BiosRequirement per file."""
+
+def validate_format(self, raw_data: str) -> bool:
+    """Return False if the upstream format has changed unexpectedly."""
+```
+
+### BiosRequirement fields
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `name` | `str` | Filename as the platform expects it |
+| `system` | `str` | Retrobios system ID (e.g. `sony-playstation`) |
+| `sha1` | `str \| None` | SHA1 hash if available |
+| `md5` | `str \| None` | MD5 hash if available |
+| `crc32` | `str \| None` | CRC32 if available |
+| `size` | `int \| None` | Expected file size in bytes |
+| `destination` | `str` | Relative path within the BIOS directory |
+| `required` | `bool` | Whether the platform considers this file mandatory |
+| `zipped_file` | `str \| None` | If set, the hash refers to a ROM inside a ZIP |
+| `native_id` | `str \| None` | Original system name before normalization |
+
+### System ID mapping
+
+Every scraper needs a mapping from the platform's native system identifiers to
+retrobios system IDs. Define this as a module-level dict:
+
+```python
+SLUG_MAP: dict[str, str] = {
+    "psx": "sony-playstation",
+    "saturn": "sega-saturn",
+    ...
+}
+```
+
+Warn on unmapped slugs so new systems are surfaced during scraping.
+
+### generate_platform_yaml (optional)
+
+If the scraper defines a `generate_platform_yaml() -> dict` method, the shared
+CLI will use it instead of the generic YAML builder. This allows the scraper to
+include platform metadata (homepage, version, inherits, cores list) in the output.
+
+### CLI entry point
+
+Add a `main()` function and `__main__` guard:
+
+```python
+def main():
+    from scripts.scraper.base_scraper import scraper_cli
+    scraper_cli(Scraper, "Scrape MyPlatform BIOS requirements")
+
+if __name__ == "__main__":
+    main()
+```
+
+`scraper_cli` provides `--dry-run`, `--json`, and `--output` flags automatically.
+
+### Test the scraper
+
+```bash
+python -m scripts.scraper.myplatform_scraper --dry-run
+```
+
+This fetches from upstream and prints a summary without writing anything.
+
+## Step 2: Register the platform
+
+Add an entry to `platforms/_registry.yml` under the `platforms:` key.
+
+### Required fields
+
+```yaml
+platforms:
+  myplatform:
+    config: myplatform.yml           # platform YAML filename in platforms/
+    status: active                   # active or archived
+    scraper: myplatform              # matches PLATFORM_NAME in the scraper
+    source_url: https://...          # upstream data URL
+    source_format: json              # json, xml, clrmamepro_dat, python_dict, bash_script+csv, csharp_firmware_database, github_component_manifests
+    hash_type: md5                   # primary hash in the upstream data
+    verification_mode: md5           # how the platform checks files: existence, md5, sha1
+    base_destination: bios           # where files go on disk
+    cores:                           # which emulator profiles apply
+    - core_a
+    - core_b
+```
+
+The `cores` field determines which emulator profiles are resolved for this platform.
+Three strategies exist:
+
+- **Explicit list**: `cores: [beetle_psx, dolphin, ...]` - match by profile key name.
+  Used by Batocera, Recalbox, RetroBat, RomM.
+- **all_libretro**: `cores: all_libretro` - include every profile with `type: libretro`
+  or `type: standalone + libretro`. Used by RetroArch, Lakka, RetroPie.
+- **Omitted**: fallback to system ID intersection. Used by EmuDeck.
+
+### Optional fields
+
+```yaml
+    logo: https://...                # SVG or PNG for UI/docs
+    schedule: weekly                 # scrape frequency: weekly, monthly, or null
+    inherits_from: retroarch         # inherit systems/cores from another platform
+    case_insensitive_fs: true        # if the platform runs on case-insensitive filesystems
+    target_scraper: myplatform_targets  # hardware target scraper name
+    target_source: https://...       # target data source URL
+    install:
+      detect:                        # auto-detection for install.py
+      - os: linux
+        method: config_file
+        config: $HOME/.config/myplatform/config.ini
+        parse_key: bios_directory
+```
+
+### Inheritance
+
+If the new platform inherits from an existing one (e.g. Lakka inherits RetroArch),
+set `inherits_from` in the registry AND add `inherits: retroarch` in the platform
+YAML itself. `load_platform_config()` reads the `inherits:` field from the YAML to
+merge parent systems and shared groups into the child. The child YAML only needs to
+declare overrides.
+
+## Step 3: Generate the platform YAML
+
+Run the scraper with `--output` to produce the initial platform configuration:
+
+```bash
+python -m scripts.scraper.myplatform_scraper --output platforms/myplatform.yml
+```
+
+If a file already exists at the output path, the CLI preserves fields that the
+scraper does not generate (e.g. `data_directories`, manually added metadata).
+Only the `systems` section is replaced.
+
+Verify the result:
+
+```bash
+python scripts/verify.py --platform myplatform
+python scripts/verify.py --platform myplatform --verbose
+```
+
+## Step 4: Add verification logic
+
+Check how the platform verifies BIOS files by reading its source code.
+The `verification_mode` in the registry tells `verify.py` which strategy to use:
+
+| Mode | Behavior | Example platforms |
+|------|----------|-------------------|
+| `existence` | File must exist, no hash check | RetroArch, Lakka, RetroPie |
+| `md5` | MD5 must match the declared hash | Batocera, Recalbox, RetroBat, EmuDeck, RetroDECK |
+| `sha1` | SHA1 must match | BizHawk |
+
+If the platform has unique verification behavior (e.g. Batocera's `checkInsideZip`,
+Recalbox's multi-hash comma-separated MD5, RomM's size + any-hash), add the logic
+to `verify.py` in the platform-specific verification path.
+
+Read the platform's source code to understand its exact verification behavior before writing any logic. Batocera's `checkInsideZip` uses `casefold()` for case-insensitive matching. Recalbox supports comma-separated MD5 lists. RomM checks file size before hashing. These details matter: the project replicates native behavior, not an approximation of it.
+
+## Step 5: Create an exporter (optional)
+
+Exporters convert truth data back to the platform's native format. They live in
+`scripts/exporter/` and follow the same auto-discovery pattern (`*_exporter.py`).
+
+### Module contract
+
+The module must export an `Exporter` class inheriting `BaseExporter`:
+
+```python
+from scripts.exporter.base_exporter import BaseExporter
+
+class Exporter(BaseExporter):
+    @staticmethod
+    def platform_name() -> str:
+        return "myplatform"
+
+    def export(self, truth_data: dict, output_path: str, scraped_data: dict | None = None) -> None:
+        # Write truth_data in the platform's native format to output_path
+        ...
+
+    def validate(self, truth_data: dict, output_path: str) -> list[str]:
+        # Return a list of issues (empty = valid)
+        ...
+```
+
+`BaseExporter` provides helper methods:
+
+- `_is_pattern(name)` - True if the filename contains wildcards or placeholders.
+- `_dest(fe)` - resolve destination path from a file entry dict.
+- `_display_name(sys_id, scraped_sys)` - convert a system slug to a display name.
+
+### Round-trip validation
+
+The exporter enables a scrape-export-compare workflow:
+
+```bash
+# Scrape upstream
+python -m scripts.scraper.myplatform_scraper --output /tmp/scraped.yml
+# Export truth data
+python scripts/export_native.py --platform myplatform --output /tmp/exported.json
+# Compare exported file with upstream
+diff /tmp/scraped.yml /tmp/exported.json
+```
+
+## Step 6: Create a target scraper (optional)
+
+Target scrapers determine which emulator cores are available on each hardware
+target (e.g. which RetroArch cores exist for Switch, RPi4, or x86_64).
+They live in `scripts/scraper/targets/` and are auto-discovered by filename
+(`*_targets_scraper.py`).
+
+### Module contract
+
+```python
+from scripts.scraper.targets import BaseTargetScraper
+
+PLATFORM_NAME = "myplatform_targets"
+
+class Scraper(BaseTargetScraper):
+    def fetch_targets(self) -> dict:
+        return {
+            "platform": "myplatform",
+            "source": "https://...",
+            "scraped_at": "2026-03-30T00:00:00Z",
+            "targets": {
+                "x86_64": {
+                    "architecture": "x86_64",
+                    "cores": ["beetle_psx", "dolphin", "..."],
+                },
+                "rpi4": {
+                    "architecture": "aarch64",
+                    "cores": ["pcsx_rearmed", "mgba", "..."],
+                },
+            },
+        }
+```
+
+Add `target_scraper` and `target_source` to the platform's registry entry.
+
+### Overrides
+
+Hardware-specific overrides go in `platforms/targets/_overrides.yml`. This file
+defines aliases (e.g. `arm64` maps to `aarch64`) and per-platform core
+additions/removals that the scraper cannot determine automatically.
+
+### Single-target platforms
+
+For platforms that only run on one target (e.g. RetroBat on Windows, RomM in the
+browser), create a static YAML file in `platforms/targets/` instead of a scraper.
+Set `target_scraper: null` in the registry.
+
+## Step 7: Add install detection (optional)
+
+The `install` section in `_registry.yml` tells `install.py` how to detect
+the platform on the user's machine and locate its BIOS directory.
+
+Three detection methods are available:
+
+| Method | Description | Fields |
+|--------|-------------|--------|
+| `config_file` | Parse a key from a config file | `config`, `parse_key`, optionally `bios_subdir` |
+| `path_exists` | Check if a directory exists | `path`, optionally `bios_path` |
+| `file_exists` | Check if a file exists | `file`, optionally `bios_path` |
+
+Each entry is scoped to an OS (`linux`, `darwin`, `windows`). Multiple entries
+per OS are tried in order.
+
+## Step 8: Validate the full pipeline
+
+After all pieces are in place, run the full pipeline:
+
+```bash
+python scripts/pipeline.py --offline
+```
+
+This executes in sequence:
+
+1. `generate_db.py` - rebuild `database.json` from `bios/`
+2. `refresh_data_dirs.py` - update data directories
+3. `verify.py --all` - verify all platforms including the new one
+4. `generate_pack.py --all` - build ZIP packs
+5. Consistency check - verify counts match between verify and pack
+
+Check the output for:
+
+- The new platform appears in verify results
+- No unexpected CRITICAL or WARNING entries
+- Pack generation succeeds and includes the expected files
+- Consistency check passes (verify file counts match pack file counts)
+
+Verification is not optional. A platform that passes `pipeline.py` today may break tomorrow if upstream changes its data format. Run the full pipeline on every change, even if the modification seems trivial. The consistency check (verify counts must match pack counts) catches subtle issues where files resolve during verification but fail during pack generation, or vice versa.
+
+## Checklist
+
+- [ ] Scraper file in `scripts/scraper/<name>_scraper.py`
+- [ ] `PLATFORM_NAME` and `Scraper` class exported
+- [ ] `fetch_requirements()` and `validate_format()` implemented
+- [ ] System ID mapping covers all upstream systems
+- [ ] Entry added to `platforms/_registry.yml`
+- [ ] Platform YAML generated and verified
+- [ ] `python scripts/pipeline.py --offline` passes
+- [ ] Exporter in `scripts/exporter/<name>_exporter.py` (if applicable)
+- [ ] Target scraper in `scripts/scraper/targets/<name>_targets_scraper.py` (if applicable)
+- [ ] Install detection entries in `_registry.yml` (if applicable)
@@ -0,0 +1,423 @@
+# Adding a scraper
+
+How to create or modify a scraper for fetching BIOS requirements from upstream
+platform sources.
+
+## Scraper architecture
+
+### Plugin discovery
+
+Scrapers are discovered automatically at import time. The `scripts/scraper/__init__.py`
+module uses `pkgutil.iter_modules` to scan for files matching `*_scraper.py` in
+the scraper directory. Each module must export:
+
+- `PLATFORM_NAME: str` - the platform identifier (matches `_registry.yml`)
+- `Scraper: class` - a subclass of `BaseScraper`
+
+No registration code is needed. Drop a file, export the two names, and it works.
+
+```python
+# scripts/scraper/__init__.py (simplified)
+for finder, name, ispkg in pkgutil.iter_modules([package_dir]):
+    if not name.endswith("_scraper"):
+        continue
+    module = importlib.import_module(f".{name}", package=__package__)
+    # looks for PLATFORM_NAME and Scraper attributes
+```
+
+### BaseScraper ABC
+
+`BaseScraper` (`scripts/scraper/base_scraper.py`) provides the foundation:
+
+```
+BaseScraper
+  __init__(url: str)
+  _fetch_raw() -> str               # HTTP GET, cached, 50 MB limit
+  fetch_requirements() -> list       # abstract: parse upstream data
+  validate_format(raw_data) -> bool  # abstract: detect format changes
+  compare_with_config(config) -> ChangeSet  # diff against existing YAML
+  test_connection() -> bool          # reachability check
+```
+
+`_fetch_raw()` handles HTTP with `urllib.request`, sets a `retrobios-scraper/1.0`
+user-agent, enforces a 30-second timeout, and reads the response in 64 KB chunks
+with a 50 MB hard limit to prevent memory exhaustion. The result is cached on the
+instance after the first call.
+
+### BiosRequirement
+
+A dataclass representing a single BIOS file entry:
+
+```python
+@dataclass
+class BiosRequirement:
+    name: str                        # filename
+    system: str                      # retrobios system ID
+    sha1: str | None = None
+    md5: str | None = None
+    crc32: str | None = None
+    size: int | None = None
+    destination: str = ""            # relative path in BIOS dir
+    required: bool = True
+    zipped_file: str | None = None   # ROM name inside a ZIP
+    native_id: str | None = None     # original system name
+```
+
+### ChangeSet
+
+Returned by `compare_with_config()`. Contains:
+
+- `added: list[BiosRequirement]` - new files not in the existing config
+- `removed: list[BiosRequirement]` - files present in config but gone upstream
+- `modified: list[tuple[BiosRequirement, BiosRequirement]]` - hash changes
+- `has_changes: bool` - True if any of the above are non-empty
+- `summary() -> str` - human-readable summary (e.g. `+3 added, ~1 modified`)
+
+### scraper_cli
+
+`scraper_cli(scraper_class, description)` provides a shared CLI with three modes:
+
+| Flag | Behavior |
+|------|----------|
+| `--dry-run` | Fetch and print a summary grouped by system |
+| `--json` | Output all requirements as JSON |
+| `--output FILE` | Write platform YAML to FILE |
+
+When `--output` targets an existing file, the CLI preserves keys not generated by
+the scraper (e.g. `data_directories`, manual additions). Only the `systems` section
+is replaced. If the scraper defines `generate_platform_yaml()`, that method is used
+instead of the generic YAML builder.
+
+### Helper functions
+
+Two additional functions in `base_scraper.py`:
+
+- `fetch_github_latest_version(repo)` - fetches the latest release tag via GitHub API.
+- `fetch_github_latest_tag(repo, prefix)` - fetches the most recent tag matching
+  an optional prefix.
+
+## Creating a BIOS scraper
+
+### Minimal example
+
+Based on the RomM scraper pattern (JSON source, flat structure):
+
+```python
+"""Scraper for MyPlatform BIOS requirements."""
+from __future__ import annotations
+
+import json
+import sys
+
+try:
+    from .base_scraper import BaseScraper, BiosRequirement
+except ImportError:
+    from base_scraper import BaseScraper, BiosRequirement
+
+PLATFORM_NAME = "myplatform"
+
+SOURCE_URL = "https://raw.githubusercontent.com/org/repo/main/bios_list.json"
+
+SLUG_MAP: dict[str, str] = {
+    "psx": "sony-playstation",
+    "saturn": "sega-saturn",
+}
+
+
+class Scraper(BaseScraper):
+    def __init__(self, url: str = SOURCE_URL):
+        super().__init__(url=url)
+
+    def fetch_requirements(self) -> list[BiosRequirement]:
+        raw = self._fetch_raw()
+        if not self.validate_format(raw):
+            raise ValueError("Format validation failed")
+
+        data = json.loads(raw)
+        requirements = []
+        for entry in data:
+            system = SLUG_MAP.get(entry["platform"])
+            if not system:
+                print(f"Warning: unmapped '{entry['platform']}'", file=sys.stderr)
+                continue
+            requirements.append(BiosRequirement(
+                name=entry["filename"],
+                system=system,
+                md5=entry.get("md5"),
+                sha1=entry.get("sha1"),
+                size=entry.get("size"),
+                destination=entry["filename"],
+                required=entry.get("required", True),
+            ))
+        return requirements
+
+    def validate_format(self, raw_data: str) -> bool:
+        try:
+            data = json.loads(raw_data)
+        except (json.JSONDecodeError, TypeError):
+            return False
+        return isinstance(data, list) and len(data) > 0
+
+
+def main():
+    from scripts.scraper.base_scraper import scraper_cli
+    scraper_cli(Scraper, "Scrape MyPlatform BIOS requirements")
+
+if __name__ == "__main__":
+    main()
+```
+
+### Parsing different upstream formats
+
+Each platform stores its BIOS requirements differently. The scraper's job is to
+normalize them into `BiosRequirement` entries.
+
+| Format | Example | Parsing approach |
+|--------|---------|-----------------|
+| JSON | RomM `known_bios_files.json` | `json.loads()`, iterate keys |
+| XML | Recalbox `es_bios.xml` | `xml.etree.ElementTree`, xpath or iter |
+| clrmamepro DAT | RetroArch `System.dat` | Use `dat_parser` module (see below) |
+| Python dict | Batocera `batocera-systems` | `ast.literal_eval` or regex extraction |
+| Bash script | EmuDeck `checkBIOS.sh` | Line-by-line regex parsing |
+| C# source | BizHawk `FirmwareDatabase.cs` | Regex for method calls and string literals |
+| C source | MAME/FBNeo drivers | Use `mame_parser` or `fbneo_parser` (see below) |
+| JSON (GitHub API) | RetroDECK component manifests | `json.loads()` per manifest file |
+
+### System ID mapping
+
+Every scraper maintains a `SLUG_MAP` (or equivalent) that translates the platform's
+native system identifiers to retrobios system IDs. The retrobios system ID format
+is `manufacturer-console` in lowercase with hyphens (e.g. `sony-playstation`,
+`sega-mega-drive`, `nintendo-gba`).
+
+When a native slug has no mapping, print a warning to stderr. This surfaces new
+systems added upstream that need to be mapped.
+
+System ID consistency matters for cross-platform operations. The same console must use the same ID across all scrapers and platforms. Before inventing a new ID, check existing profiles and platform YAMLs for precedent. The canonical format is `manufacturer-console` in lowercase with hyphens (e.g., `sony-playstation`, `sega-mega-drive`). The `SYSTEM_ALIASES` dict in `common.py` maps common variations to canonical IDs.
+
+### Hash normalization
+
+- Normalize all hashes to lowercase hex strings.
+- Handle missing hashes gracefully (set to `None`, not empty string).
+- Some platforms provide multiple hash types per entry. Populate whichever fields
+  are available.
+- Batocera uses 29-character truncated MD5 hashes in some entries. The resolution
+  layer handles prefix matching, but the scraper should store the hash as-is.
+
+Scraped data reflects what the upstream declares, which may not match reality. The scraper's job is faithful transcription of upstream data, not correction. Corrections happen in the emulator profiles (source-verified) and in `_shared.yml` (curated). If a scraper detects an obviously wrong hash or filename, log a warning but still include the upstream value. The divergence will surface during truth diffing.
+
+## Creating a target scraper
+
+Target scrapers determine which emulator cores are available on each hardware
+target. They live in `scripts/scraper/targets/` and follow the same auto-discovery
+pattern (`*_targets_scraper.py`).
+
+### BaseTargetScraper ABC
+
+```
+BaseTargetScraper
+  __init__(url: str)
+  fetch_targets() -> dict    # abstract: return target data
+  write_output(data, path)   # write YAML to disk
+```
+
+### Output format
+
+`fetch_targets()` must return a dict with this structure:
+
+```python
+{
+    "platform": "myplatform",
+    "source": "https://...",
+    "scraped_at": "2026-03-30T12:00:00Z",
+    "targets": {
+        "x86_64": {
+            "architecture": "x86_64",
+            "cores": ["beetle_psx", "dolphin", "snes9x"],
+        },
+        "rpi4": {
+            "architecture": "aarch64",
+            "cores": ["pcsx_rearmed", "mgba"],
+        },
+    },
+}
+```
+
+The `targets` dict maps target names to their available cores. Core names must
+match the names used in emulator profile `cores:` fields for the target filtering
+pipeline to work correctly.
+
+### Overrides
+
+`platforms/targets/_overrides.yml` provides post-scrape adjustments:
+
+- **aliases**: map alternate target names to canonical ones (e.g. `arm64` -> `aarch64`)
+- **add_cores**: cores present on a target but not detected by the scraper
+- **remove_cores**: cores detected by the scraper but not actually functional
+
+Overrides are applied by `load_target_config()` in `common.py` after loading
+the scraped data. The scraper itself does not need to handle overrides.
+
+### Module contract
+
+```python
+from scripts.scraper.targets import BaseTargetScraper
+
+PLATFORM_NAME = "myplatform_targets"
+
+class Scraper(BaseTargetScraper):
+    def __init__(self):
+        super().__init__(url="https://...")
+
+    def fetch_targets(self) -> dict:
+        # Fetch and parse target data
+        ...
+```
+
+Register the target scraper in `_registry.yml`:
+
+```yaml
+  myplatform:
+    target_scraper: myplatform_targets
+    target_source: https://...
+```
+
+### Existing target scrapers
+
+| Scraper | Source | Approach |
+|---------|--------|----------|
+| `retroarch_targets` | libretro buildbot nightly | Scrape directory listings for each target arch |
+| `batocera_targets` | Config.in + es_systems.yml | Cross-reference kernel config with system definitions |
+| `emudeck_targets` | EmuScripts + RetroArch cores | GitHub API for script availability per OS |
+| `retropie_targets` | scriptmodules + rp_module_flags | Parse Bash scriptmodules for platform flags |
+
+## Parser modules
+
+Shared parsers in `scripts/scraper/` handle formats used by multiple scrapers
+or formats complex enough to warrant dedicated parsing logic.
+
+### dat_parser
+
+Parses clrmamepro DAT format as used in RetroArch's `System.dat`:
+
+```
+game (
+    name "System"
+    comment "Platform Name"
+    rom ( name filename size 12345 crc ABCD1234 md5 ... sha1 ... )
+)
+```
+
+Produces `DatRom` dataclass instances with `name`, `size`, `crc32`, `md5`, `sha1`,
+and `system` fields. The `libretro_scraper` uses this parser.
+
+### mame_parser
+
+Parses MAME C source files to extract BIOS root sets. Handles:
+
+- Machine declaration macros: `GAME`, `SYST`, `COMP`, `CONS`
+- `MACHINE_IS_BIOS_ROOT` flag detection
+- `ROM_START`/`ROM_END` blocks
+- `ROM_LOAD` variants and `ROM_REGION` declarations
+- `ROM_SYSTEM_BIOS` entries
+- `NO_DUMP` filtering and `BAD_DUMP` flagging
+
+Used by `mame_hash_scraper` to auto-fetch BIOS hashes from MAME driver sources
+for each tagged MAME version.
+
+### fbneo_parser
+
+Parses FBNeo C source files:
+
+- `BurnRomInfo` structs (static ROM arrays with name, size, CRC)
+- `BurnDriver` structs (driver registration with `BDF_BOARDROM` flag)
+- BIOS set identification via the boardrom flag
+
+Used by `fbneo_hash_scraper` to extract BIOS ROM definitions.
+
+### _hash_merge
+
+Text-based YAML patching that merges fetched hash data into emulator profiles
+while preserving formatting. Two strategies:
+
+- **MAME**: updates `bios_zip` entries with `contents` lists (name, size, CRC32)
+- **FBNeo**: updates individual ROM entries grouped by `archive` field
+
+The merge preserves fields the hash data does not generate (system, note, required)
+and leaves entries not present in the hash data untouched. Uses text-level YAML
+manipulation rather than load-dump to maintain human-readable formatting.
+
+## Testing
+
+### Development workflow
+
+1. **Start with --dry-run**. It's helpful to preview before writing output:
+
+   ```bash
+   python -m scripts.scraper.myplatform_scraper --dry-run
+   ```
+
+2. **Check JSON output** for data quality:
+
+   ```bash
+   python -m scripts.scraper.myplatform_scraper --json | python -m json.tool | head -50
+   ```
+
+3. **Compare with existing YAML** if updating a scraper:
+
+   ```bash
+   python -m scripts.scraper.myplatform_scraper --output /tmp/test.yml
+   diff platforms/myplatform.yml /tmp/test.yml
+   ```
+
+4. **Run verification** after generating:
+
+   ```bash
+   python scripts/verify.py --platform myplatform
+   python scripts/verify.py --platform myplatform --verbose
+   ```
+
+5. **Run the full pipeline** before committing:
+
+   ```bash
+   python scripts/pipeline.py --offline
+   ```
+
+### Round-trip testing
+
+If an exporter exists for the platform, validate the scrape-export-compare cycle:
+
+```bash
+# Scrape upstream -> platform YAML
+python -m scripts.scraper.myplatform_scraper --output /tmp/scraped.yml
+
+# Export truth data -> native format
+python scripts/export_native.py --platform myplatform --output /tmp/exported.json
+
+# Compare
+diff <(python -m scripts.scraper.myplatform_scraper --json | python -m json.tool) \
+     /tmp/exported.json
+```
+
+### Common issues
+
+| Symptom | Cause | Fix |
+|---------|-------|-----|
+| Unmapped slug warnings | New system added upstream | Add mapping to `SLUG_MAP` |
+| Empty requirements list | Upstream format changed | Check `validate_format()`, update parser |
+| Hash mismatch in verify | Upstream updated hashes | Re-scrape and regenerate platform YAML |
+| Scraper hangs | URL unreachable, no timeout | `_fetch_raw()` has 30s timeout; check URL |
+| `Response exceeds 50 MB` | Upstream file grew | Investigate; may need chunked parsing |
+| `validate_format` fails | Upstream restructured | Update both `validate_format` and `fetch_requirements` |
+
+### E2E tests
+
+The project's test suite (`tests/test_e2e.py`) covers scraper integration at the
+pipeline level. When adding a new scraper, verify that the full pipeline passes:
+
+```bash
+python -m unittest tests.test_e2e
+python scripts/pipeline.py --offline
+```
+
+Both must pass before the scraper is considered complete.
@@ -0,0 +1,280 @@
+# Advanced Usage
+
+Fine-grained control over pack generation, hardware filtering, truth analysis, and verification.
+
+## Custom Packs
+
+### Build from hash
+
+Look up a single MD5 in the database:
+
+```bash
+python scripts/generate_pack.py --from-md5 d8f1206299c48946e6ec5ef96d014eaa
+```
+
+Build a pack containing only files matching hashes from a list (one MD5 per line, `#` for comments):
+
+```bash
+python scripts/generate_pack.py --platform batocera --from-md5-file missing.txt
+```
+
+This is useful when a platform reports missing files and you want to generate a targeted pack
+rather than re-downloading the full archive.
+
+### Split packs
+
+Generate one ZIP per system instead of a single monolithic pack:
+
+```bash
+python scripts/generate_pack.py --platform retroarch --split
+```
+
+Group the split ZIPs by manufacturer (Sony, Nintendo, Sega, etc.):
+
+```bash
+python scripts/generate_pack.py --platform retroarch --split --group-by manufacturer
+```
+
+### System-specific packs
+
+Extract only the files for a single system within a platform:
+
+```bash
+python scripts/generate_pack.py --platform retroarch --system sony-playstation
+```
+
+### Required only
+
+Exclude optional files from the pack:
+
+```bash
+python scripts/generate_pack.py --platform batocera --required-only
+```
+
+What counts as "required" depends on the platform YAML. For existence-mode platforms
+(RetroArch), the distinction comes from the `.info` file's `required` field.
+For MD5-mode platforms (Batocera), all declared files are treated as required unless
+explicitly marked optional.
+
+
+## Hardware Target Filtering
+
+### What targets are
+
+A target represents a hardware architecture where a platform runs. Each architecture
+has a different set of available cores. For example, the RetroArch Switch target
+has fewer cores than the x86_64 target because some cores are not ported to ARM.
+
+Target data is scraped from upstream sources (buildbot nightly listings, board configs,
+scriptmodules) and stored in `platforms/targets/<platform>.yml`.
+
+### Usage
+
+Filter packs or verification to only include systems reachable by cores available
+on the target hardware:
+
+```bash
+python scripts/generate_pack.py --platform retroarch --target switch
+python scripts/generate_pack.py --all --target x86_64
+python scripts/verify.py --platform batocera --target rpi4
+```
+
+When combined with `--all`, platforms that define the target are filtered. Platforms
+without a target file for that name are left unfiltered (no information to exclude anything).
+Platforms that have target data but not the requested target are skipped with an INFO message.
+
+### How it works
+
+The filtering pipeline has three stages:
+
+1. **`load_target_config()`** reads `platforms/targets/<platform>.yml` and returns
+   the set of cores available on the target. Aliases from `_overrides.yml` are resolved
+   (e.g., `--target rpi4` may match `bcm2711` in the target file).
+
+2. **`resolve_platform_cores()`** determines which emulator profiles are relevant
+   for the platform, then intersects the result with the target's core set. The
+   intersection uses a reverse index built from each profile's `cores:` field, so
+   that upstream names (e.g., `mednafen_psx` on the buildbot) map to profile keys
+   (e.g., `beetle_psx`).
+
+3. **`filter_systems_by_target()`** removes platform systems where every core that
+   emulates them is absent from the target. Systems with no core information are kept
+   (benefit of the doubt). System ID normalization strips manufacturer prefixes and
+   separators so that `xbox` matches `microsoft-xbox`.
+
+### List available targets
+
+```bash
+python scripts/verify.py --platform retroarch --list-targets
+```
+
+### Overrides
+
+`platforms/targets/_overrides.yml` provides two mechanisms:
+
+- **Aliases**: map user-facing names to internal target IDs
+  (e.g., `rpi4` -> `bcm2711`).
+- **add/remove cores**: patch the scraped core list for a specific target
+  without overwriting the entire file. Useful when a core is known to work
+  but is not listed on the buildbot, or vice versa.
+
+### Single-target platforms
+
+Platforms with only one target (e.g., RetroBat with `windows`, RomM with `browser`)
+treat `--target <their-only-target>` as a no-op: the output is identical to running
+without `--target`.
+
+
+## Truth Generation and Diffing
+
+### What truth is
+
+Truth data is ground truth generated from emulator profiles. It represents what each
+core actually needs based on source code analysis, independent of what platform
+scrapers declare. The purpose is gap analysis: finding files that platforms miss
+or declare incorrectly.
+
+### Generate truth
+
+Build truth YAMLs from emulator profiles for a platform or all platforms:
+
+```bash
+python scripts/generate_truth.py --platform retroarch
+python scripts/generate_truth.py --all --output-dir dist/truth/
+```
+
+Each truth YAML lists every system with its files, hashes, and the emulator profiles
+that reference them. The output mirrors the platform YAML structure so the two can
+be diffed directly.
+
+### Diff truth vs scraped
+
+Find divergences between generated truth and scraped platform data:
+
+```bash
+python scripts/diff_truth.py --platform retroarch
+python scripts/diff_truth.py --all
+```
+
+The diff reports:
+
+- Files present in truth but absent from the platform YAML (undeclared).
+- Files present in the platform YAML but absent from truth (orphaned or from cores
+  not profiled yet).
+- Hash mismatches between truth and platform data.
+
+### Export to native formats
+
+Convert truth data to the native format each platform consumes:
+
+```bash
+python scripts/export_native.py --platform batocera    # Python dict (batocera-systems)
+python scripts/export_native.py --platform recalbox    # XML (es_bios.xml)
+python scripts/export_native.py --all --output-dir dist/upstream/
+```
+
+This allows submitting corrections upstream in the format maintainers expect.
+
+
+## Emulator-Level Verification
+
+### Per-emulator checks
+
+Verify files against a single emulator's ground truth (size, hashes, crypto):
+
+```bash
+python scripts/verify.py --emulator handy
+python scripts/verify.py --emulator handy --verbose
+```
+
+Default output shows aggregate results per file: the core name and which checks apply.
+With `--verbose`, each file expands to one line per core with the exact validation
+parameters and source code reference:
+
+```
+lynxboot.img
+  handy validates size=512 crc32=0x0d973c9d [src/handy/system.h:45]
+```
+
+### Per-system checks
+
+Aggregate verification across all cores that emulate a system:
+
+```bash
+python scripts/verify.py --system atari-lynx
+```
+
+### Standalone mode
+
+Some cores have both libretro and standalone modes with different file requirements.
+Filter to standalone-only:
+
+```bash
+python scripts/verify.py --emulator dolphin --standalone
+```
+
+### Ground truth in verbose output
+
+The verbose report includes a coverage footer:
+
+```
+Ground truth: 142/160 files have emulator validation (88%)
+```
+
+This indicates how many files in the platform can be cross-checked against source-verified
+emulator profiles. Files without ground truth rely solely on platform-level verification.
+JSON output (`--json`) always includes the full per-emulator detail regardless of verbosity.
+
+
+## Offline Workflow
+
+### Full offline pipeline
+
+Run the entire pipeline without network access:
+
+```bash
+python scripts/pipeline.py --offline
+```
+
+This skips data directory refresh, MAME/FBNeo hash fetch, and buildbot staleness checks.
+All other steps (database generation, verification, pack building, consistency check,
+README, site generation) run normally using cached data.
+
+### Partial runs
+
+Skip pack generation when you only need verification results:
+
+```bash
+python scripts/pipeline.py --offline --skip-packs
+```
+
+Skip documentation generation:
+
+```bash
+python scripts/pipeline.py --offline --skip-docs
+```
+
+### Truth pipeline
+
+Include truth generation and diffing in the pipeline:
+
+```bash
+python scripts/pipeline.py --offline --with-truth
+```
+
+Include truth + native format export:
+
+```bash
+python scripts/pipeline.py --offline --with-export
+```
+
+### Combining flags
+
+Flags compose freely:
+
+```bash
+python scripts/pipeline.py --offline --skip-docs --with-truth --target switch
+```
+
+This runs: database generation, verification (filtered to Switch cores), truth generation
+and diff, consistency check. Packs and docs are skipped, no network access.
@@ -6,16 +6,22 @@
 bios/                    BIOS and firmware files, organized by Manufacturer/Console/
  Manufacturer/Console/  canonical files (one per unique content)
  .variants/             alternate versions (different hash, same purpose)
-emulators/               one YAML profile per core (285 profiles)
+emulators/               one YAML profile per core/engine
 platforms/               one YAML config per platform (scraped from upstream)
  _shared.yml            shared file groups across platforms
-  _registry.yml          platform metadata (logos, scrapers, status)
+  _registry.yml          platform metadata (logos, scrapers, status, install config)
  _data_dirs.yml         data directory definitions (Dolphin Sys, PPSSPP...)
+  targets/               hardware target configs + _overrides.yml
 scripts/                 all tooling (Python, pyyaml only dependency)
  scraper/               upstream scrapers (libretro, batocera, recalbox...)
+  scraper/targets/       hardware target scrapers (retroarch, batocera, emudeck, retropie)
+  exporter/              native format exporters (batocera, recalbox, emudeck...)
+install/                 JSON install manifests per platform
+  targets/               JSON target manifests per platform (cores per architecture)
 data/                    cached data directories (not BIOS, fetched at build)
 schemas/                 JSON schemas for validation
 tests/                   E2E test suite with synthetic fixtures
+_mame_clones.json        MAME parent/clone set mappings
 dist/                    generated packs (gitignored)
 .cache/                  hash cache and large file downloads (gitignored)
 ```
@@ -28,11 +34,38 @@ Upstream sources          Scrapers parse       generate_db.py scans
  batocera-systems                             builds database.json
  es_bios.xml (recalbox)                       (SHA1 primary key,
  core-info .info files                         indexes: by_md5, by_name,
-                                                by_crc32, by_path_suffix)
+  FirmwareDatabase.cs                           by_crc32, by_path_suffix)
+  MAME/FBNeo source

 emulators/*.yml          verify.py checks      generate_pack.py resolves
  source-verified         platform-native       files by hash, builds ZIP
  from code               verification          packs per platform
+
+truth.py generates       diff_truth.py         export_native.py
+  ground truth from       compares truth vs     exports to native formats
+  emulator profiles       scraped platform      (DAT, XML, JSON, Bash)
+```
+
+Pipeline runs all steps in sequence: DB, data dirs, MAME/FBNeo hashes,
+verify, packs, install manifests, target manifests, consistency check,
+README, site. See [tools](tools.md) for the full pipeline reference.
+
+```mermaid
+graph LR
+    A[generate_db] --> B[refresh_data_dirs]
+    B --> C[MAME/FBNeo hashes]
+    C --> D[verify --all]
+    D --> E[generate_pack --all]
+    E --> F[install manifests]
+    F --> G[target manifests]
+    G --> H[consistency check]
+    H --> I[generate_readme]
+    I --> J[generate_site]
+
+    style A fill:#2d333b,stroke:#adbac7,color:#adbac7
+    style D fill:#2d333b,stroke:#adbac7,color:#adbac7
+    style E fill:#2d333b,stroke:#adbac7,color:#adbac7
+    style J fill:#2d333b,stroke:#adbac7,color:#adbac7
 ```

 ## Three layers of data
@@ -46,12 +79,39 @@ emulators/*.yml          verify.py checks      generate_pack.py resolves
 The pack combines platform baseline (layer 1) with core requirements (layer 3).
 Neither too much (no files from unused cores) nor too few (no missing files for active cores).

+The emulator's source code serves as ground truth for what files are needed,
+what names they use, and what validation the emulator performs. Platform YAML
+configs are scraped from upstream and are generally accurate, though they can
+occasionally have gaps or stale entries. The emulator profiles complement the
+platform data by documenting what the code actually loads. When the two disagree,
+the profile takes precedence for pack generation: files the code needs are included
+even if the platform does not declare them. Files the platform declares but no
+profile references are kept as well (flagged during cross-reference), since the
+upstream may cover cases not yet profiled.
+
+```mermaid
+graph TD
+    PY[Platform YAML<br/>scraped from upstream] --> PG[Pack generation]
+    EP[Emulator profiles<br/>source-verified] --> PG
+    SH[_shared.yml<br/>curated shared files] --> PY
+    SH --> EP
+    PG --> ZIP[ZIP pack per platform]
+
+    style PY fill:#2d333b,stroke:#adbac7,color:#adbac7
+    style EP fill:#2d333b,stroke:#adbac7,color:#adbac7
+    style SH fill:#2d333b,stroke:#adbac7,color:#adbac7
+    style PG fill:#2d333b,stroke:#adbac7,color:#adbac7
+    style ZIP fill:#2d333b,stroke:#adbac7,color:#adbac7
+```
+
 ## Pack grouping

 Platforms that produce identical packs are grouped automatically.
 RetroArch and Lakka share the same files and `base_destination` (`system/`),
 so they produce one combined pack (`RetroArch_Lakka_BIOS_Pack.zip`).
 RetroPie uses `BIOS/` as base path, so it gets a separate pack.
+With `--target`, the fingerprint includes target cores so platforms
+with different hardware filters get separate packs.

 ## Storage tiers

@@ -99,6 +159,46 @@ If none exists, the platform version is kept.
 | RPG Maker/ScummVM | excluded from dedup (NODEDUP) to preserve directory structure |
 | `strip_components` in data dirs | flattens cache prefix to match expected path |
 | case-insensitive dedup | prevents `font.rom` + `FONT.ROM` conflicts on Windows/macOS |
+| frozen snapshot cores | `.info` may reflect current version while code is pinned to an old one. Only the frozen source at the pinned tag is reliable (e.g. desmume2015, mame2003) |
+
+### File resolution chain
+
+`resolve_local_file` in `common.py` tries each strategy in order, returning the
+first match. Used by both `verify.py` and `generate_pack.py`.
+
+```mermaid
+graph TD
+    START([resolve_local_file]) --> S0{path_suffix<br/>exact match?}
+    S0 -- yes --> EXACT([exact])
+    S0 -- no --> S1{SHA1<br/>exact match?}
+    S1 -- yes --> EXACT
+    S1 -- no --> S2{MD5 direct<br/>or truncated?}
+    S2 -- yes --> MD5([md5_exact])
+    S2 -- no --> S3{name + aliases<br/>no MD5?}
+    S3 -- yes --> EXACT
+    S3 -- no --> S4{name + aliases<br/>md5_composite /<br/>direct MD5?}
+    S4 -- match --> EXACT
+    S4 -- name only --> HM([hash_mismatch])
+    S4 -- no --> S5{zippedFile<br/>inner ROM MD5?}
+    S5 -- yes --> ZE([zip_exact])
+    S5 -- no --> S6{MAME clone<br/>map lookup?}
+    S6 -- yes --> MC([mame_clone])
+    S6 -- no --> S7{data_dir<br/>cache scan?}
+    S7 -- yes --> DD([data_dir])
+    S7 -- no --> S8{agnostic<br/>fallback?}
+    S8 -- yes --> AG([agnostic_fallback])
+    S8 -- no --> NF([not_found])
+
+    style START fill:#2d333b,stroke:#adbac7,color:#adbac7
+    style EXACT fill:#2d333b,stroke:#adbac7,color:#adbac7
+    style MD5 fill:#2d333b,stroke:#adbac7,color:#adbac7
+    style HM fill:#2d333b,stroke:#adbac7,color:#adbac7
+    style ZE fill:#2d333b,stroke:#adbac7,color:#adbac7
+    style MC fill:#2d333b,stroke:#adbac7,color:#adbac7
+    style DD fill:#2d333b,stroke:#adbac7,color:#adbac7
+    style AG fill:#2d333b,stroke:#adbac7,color:#adbac7
+    style NF fill:#2d333b,stroke:#adbac7,color:#adbac7
+```

 ## Platform inheritance

@@ -112,17 +212,36 @@ Core resolution (`resolve_platform_cores`) uses three strategies:
 - `cores: [list]` - include only named profiles
 - `cores:` absent - fallback to system ID intersection between platform and profiles

+## Hardware target filtering
+
+`--target TARGET` filters packs and verification by hardware (e.g. `switch`, `rpi4`, `x86_64`).
+Target configs are in `platforms/targets/`. Overrides in `_overrides.yml` add aliases and
+adjust core lists per target. `filter_systems_by_target` excludes systems whose cores are
+not available on the target. Without `--target`, all systems are included.
+
 ## MAME clone map

 `_mame_clones.json` at repo root maps MAME clone ROM names to their canonical parent.
 When a clone ZIP was deduplicated, `resolve_local_file` uses this map to find the canonical file.

+## Install manifests
+
+`generate_pack.py --manifest` produces JSON manifests in `install/` for each platform.
+These contain file lists with SHA1 hashes, platform detection config, and standalone copy
+instructions. `install/targets/` contains per-architecture core availability.
+The cross-platform installer (`install.py`) uses these manifests to auto-detect the
+user's platform, filter files by hardware target, and download with SHA1 verification.
+
 ## Tests

-`tests/test_e2e.py` contains 75 end-to-end tests with synthetic fixtures.
-Covers: file resolution, verification, severity, cross-reference, aliases,
-inheritance, shared groups, data dirs, storage tiers, HLE, launchers,
-platform grouping, core resolution (3 strategies + alias exclusion).
+4 test files with synthetic fixtures:
+
+| File | Coverage |
+|------|----------|
+| `test_e2e.py` | file resolution, verification, severity, cross-reference, aliases, inheritance, shared groups, data dirs, storage tiers, HLE, launchers, platform grouping, core resolution, target filtering, truth/diff, exporters |
+| `test_mame_parser.py` | BIOS root set detection, ROM block parsing, macro expansion |
+| `test_fbneo_parser.py` | BIOS set detection, ROM info parsing |
+| `test_hash_merge.py` | MAME/FBNeo YAML merge, diff detection |

 ```bash
 python -m unittest tests.test_e2e -v
@@ -132,7 +251,8 @@ python -m unittest tests.test_e2e -v

 | Workflow | File | Trigger | Role |
 |----------|------|---------|------|
-| Build & Release | `build.yml` | `workflow_dispatch` (manual) | restore large files, build packs, deploy site, create GitHub release |
+| Build & Release | `build.yml` | `workflow_dispatch` (manual) | restore large files, build packs, create GitHub release |
+| Deploy Site | `deploy-site.yml` | push to main (platforms, emulators, wiki, scripts) + manual | generate site, build with MkDocs, deploy to GitHub Pages |
 | PR Validation | `validate.yml` | pull request on `bios/`/`platforms/` | validate BIOS hashes, schema check, run tests, auto-label PR |
 | Weekly Sync | `watch.yml` | cron (Monday 6 AM UTC) + manual | scrape upstream sources, detect changes, create update PR |

@@ -0,0 +1,112 @@
+# FAQ - RetroBIOS
+
+## My game shows a black screen
+
+Most likely a missing or incorrect BIOS file. Run verification for your platform:
+
+```bash
+python scripts/verify.py --platform retroarch
+```
+
+Look for MISSING or HASH MISMATCH entries. If a file shows HASH MISMATCH, you have a BIOS file but it's the wrong version or a bad dump. Replace it with one that matches the expected hash.
+
+Some cores also support HLE (see below), so a missing BIOS may not always be the cause. Check the emulator's logs for error messages.
+
+## What's the difference between required and optional?
+
+**Required** means the emulator will not start games for that system without the file. **Optional** means the emulator works without it, but with reduced accuracy or missing features (e.g., boot screen animation, wrong font rendering, or degraded audio).
+
+In verification output, missing required files appear as CRITICAL or WARNING depending on the platform. Missing optional files appear as WARNING or INFO.
+
+## What's HLE?
+
+HLE (High-Level Emulation) is a software reimplementation of what the original BIOS does. Some cores can boot games without a real BIOS file by using their built-in HLE fallback. The trade-off is lower accuracy: some games may have glitches or fail to boot entirely.
+
+When a core has HLE support, the verification tool lowers the severity of a missing BIOS to INFO. The file is still included in packs because the real BIOS gives better results.
+
+## Why are there multiple hashes for the same file?
+
+Two main reasons:
+
+1. **Regional variants.** The same filename (e.g., `IPL.bin` for GameCube) exists in different versions for USA, Europe, and Japan. Each region has a different hash.
+2. **Revision differences.** Console manufacturers released updated BIOS versions over time. A PlayStation SCPH-5501 BIOS differs from a SCPH-7001.
+
+Platforms that verify by MD5 accept specific hashes. If yours doesn't match any known hash, it may be a bad dump or an uncommon revision.
+
+## How do I know which BIOS I need?
+
+Two approaches:
+
+1. **Run verify.py** for your platform. It lists every expected file with its hash and status.
+2. **Check the project site.** Each platform page lists all required and optional BIOS files per system.
+
+For a specific emulator core:
+
+```bash
+python scripts/verify.py --emulator beetle_psx --verbose
+```
+
+The `--verbose` flag shows source references and expected values from the emulator's source code.
+
+## Is this legal?
+
+Yes. Distribution of BIOS files, firmware, and encryption keys for emulation and preservation is supported by established case law and statutory exemptions across multiple jurisdictions.
+
+### Emulation and BIOS redistribution
+
+- **Emulation is legal.** *Sony v. Connectix* (2000) and *Sega v. Accolade* (1992) established that creating emulators and reverse-engineering console firmware for interoperability is lawful. BIOS files are functional prerequisites for this legal activity.
+- **Fair use (US, 17 USC 107).** Non-commercial redistribution of firmware for personal emulation and archival is transformative use. The files serve a different purpose (interoperability) than the original (running proprietary hardware). No commercial market exists for standalone BIOS files.
+- **Fair dealing (EU, UK, Canada, Australia).** Equivalent doctrines protect research, private study, and interoperability. The EU Software Directive (2009/24/EC, Art. 5-6) explicitly permits decompilation and use for interoperability.
+- **Abandonware.** The vast majority of firmware here is for discontinued hardware no longer sold, supported, or distributed by the original manufacturer. No active commercial market is harmed.
+
+### Encryption keys (Switch prod.keys, 3DS AES keys, Wii U keys)
+
+This is the most contested area. The legal position:
+
+- **Keys are not copyrightable.** Encryption keys are mathematical values, not creative expression. Copyright protects original works of authorship; a 256-bit number does not meet the threshold of originality. *Bernstein v. DOJ* (1996) established that code and algorithms are protected speech, and the mere publication of numeric values cannot be restricted under copyright.
+- **DMCA 1201(f) interoperability exemption.** The DMCA prohibits circumvention of technological protection measures, but Section 1201(f) explicitly permits circumvention for the purpose of achieving interoperability between programs. Emulators require these keys to decrypt and run legally purchased game software. The keys enable interoperability, not piracy.
+- **Library of Congress DMCA exemptions.** The triennial rulemaking process has repeatedly expanded exemptions for video game preservation. The 2024 exemption (37 CFR 201.40) covers circumvention for preservation of software and video games, including when the original hardware is no longer available.
+- **Keys derived from consumer hardware.** These keys are extracted from retail hardware owned by consumers. Once a product is sold, the manufacturer cannot indefinitely control how the purchaser uses or examines their own property. *Chamberlain v. Skylink* (2004) held that using a product in a way the manufacturer dislikes is not automatically a DMCA violation.
+- **No trade secret protection.** For keys to qualify as trade secrets, the holder must take reasonable steps to maintain secrecy. Keys embedded in millions of consumer devices and widely published online do not meet this standard.
+
+### Recent firmware (Switch 19.0.0, PS3UPDAT, PSVUPDAT)
+
+- **Firmware updates are freely distributed.** Nintendo, Sony, and other manufacturers distribute firmware updates via CDN without authentication or purchase requirements. Redistributing freely available data does not create new legal liability.
+- **Functional necessity.** Emulators require system firmware to function. Providing firmware is equivalent to providing the operating environment the software was designed to run in.
+- **Yuzu context.** The Yuzu settlement (2024) concerned the emulator itself and its facilitation of piracy, not the legality of firmware or key distribution. Yuzu settled without admitting liability and the case created no binding precedent against BIOS or key redistribution.
+
+### Summary
+
+This project distributes BIOS files, firmware, and encryption keys for personal use, archival, and interoperability with emulation software. The legal basis rests on fair use, statutory interoperability exemptions, preservation precedent, and the non-copyrightable nature of encryption keys.
+
+## What's a hash/checksum?
+
+A hash is a fixed-length fingerprint computed from a file's contents. If even one byte differs, the hash changes completely. The project uses three types:
+
+| Type | Length | Example |
+|------|--------|---------|
+| MD5 | 32 hex chars | `924e392ed05558ffdb115408c263dccf` |
+| SHA1 | 40 hex chars | `10155d8d6e6e832d8ea1571511e40dfb15fede05` |
+| CRC32 | 8 hex chars | `2F468B96` |
+
+Different platforms use different hash types for verification. Batocera uses MD5, RetroArch checks existence only, and RomM accepts any of the three.
+
+## Why does my verification report say UNTESTED?
+
+UNTESTED means the file exists on disk but its hash was not confirmed against a known value. This happens on existence-mode platforms (RetroArch, Lakka, RetroPie) where the platform only checks that the file is present, without verifying its contents.
+
+The file may still be correct. Running `verify.py --emulator <core> --verbose` shows the emulator-level ground truth, which can confirm whether the file's hash matches what the source code expects.
+
+## Can I use BIOS from one platform on another?
+
+Yes. BIOS files are console-specific, not platform-specific. A PlayStation BIOS works in RetroArch, Batocera, Recalbox, and any other platform that emulates PlayStation. The only differences between platforms are:
+
+- **Where the file goes** (each platform has its own BIOS directory)
+- **What filename is expected** (usually the same, occasionally different)
+- **How verification works** (MD5 check vs. existence check)
+
+The packs differ per platform because each platform declares its own set of supported systems and expected files.
+
+## How often are packs updated?
+
+A weekly automated sync checks upstream sources (libretro System.dat, batocera-systems, etc.) for changes. If differences are found, a pull request is created automatically. Manual releases happen as needed when new BIOS files are added or profiles are updated.
@@ -0,0 +1,156 @@
+# Getting started - RetroBIOS
+
+## What are BIOS files?
+
+BIOS files are firmware dumps from original console hardware. Emulators need them to boot games for systems that relied on built-in software (PlayStation, Saturn, Dreamcast, etc.). Without the correct BIOS, the emulator either refuses to start the game or falls back to less accurate software emulation.
+
+## Installation
+
+Three ways to get BIOS files in place, from easiest to most manual.
+
+### Option 1: install.py (recommended)
+
+Self-contained Python script, no dependencies beyond Python 3.10+. Auto-detects your platform and BIOS directory.
+
+```bash
+python install.py
+```
+
+Override detection if needed:
+
+```bash
+python install.py --platform retroarch --dest ~/custom/bios
+python install.py --check          # verify existing files without downloading
+python install.py --list-platforms  # show supported platforms
+```
+
+The installer downloads files from GitHub releases, verifies SHA1 checksums, and places them in the correct directory.
+
+### Option 2: download.sh (Linux/macOS)
+
+One-liner for systems with `curl` or `wget`:
+
+```bash
+bash scripts/download.sh retroarch ~/RetroArch/system/
+bash scripts/download.sh --list  # show available packs
+```
+
+### Option 3: manual download
+
+1. Go to the [releases page](https://github.com/Abdess/retrobios/releases)
+2. Download the ZIP pack for your platform
+3. Extract to the BIOS directory listed below
+
+## BIOS directory by platform
+
+### RetroArch
+
+RetroArch uses the `system_directory` setting in `retroarch.cfg`. Default locations:
+
+| OS | Default path |
+|----|-------------|
+| Windows | `%APPDATA%\RetroArch\system\` |
+| Linux | `~/.config/retroarch/system/` |
+| Linux (Flatpak) | `~/.var/app/org.libretro.RetroArch/config/retroarch/system/` |
+| macOS | `~/Library/Application Support/RetroArch/system/` |
+| Steam Deck | `~/.var/app/org.libretro.RetroArch/config/retroarch/system/` |
+| Android | `/storage/emulated/0/RetroArch/system/` |
+
+To check your actual path: open RetroArch, go to **Settings > Directory > System/BIOS**, or look for `system_directory` in `retroarch.cfg`.
+
+### Batocera
+
+```
+/userdata/bios/
+```
+
+Accessible via network share at `\\BATOCERA\share\bios\` (Windows) or `smb://batocera/share/bios/` (macOS/Linux).
+
+### Recalbox
+
+```
+/recalbox/share/bios/
+```
+
+Accessible via network share at `\\RECALBOX\share\bios\`.
+
+### RetroBat
+
+```
+bios/
+```
+
+Relative to the RetroBat installation directory (e.g., `C:\RetroBat\bios\`).
+
+### RetroDECK
+
+```
+~/.var/app/net.retrodeck.retrodeck/retrodeck/bios/
+```
+
+### EmuDeck
+
+```
+Emulation/bios/
+```
+
+Located inside your Emulation folder. On Steam Deck, typically `~/Emulation/bios/`.
+
+### Lakka
+
+```
+/storage/system/
+```
+
+Accessible via SSH or Samba.
+
+### RetroPie
+
+```
+~/RetroPie/BIOS/
+```
+
+### BizHawk
+
+```
+Firmware/
+```
+
+Relative to the BizHawk installation directory.
+
+### RomM
+
+BIOS files are managed through the RomM web interface. Check the
+[RomM documentation](https://github.com/rommapp/romm) for setup details.
+
+## Verifying your setup
+
+After placing BIOS files, verify that everything is correct:
+
+```bash
+python scripts/verify.py --platform retroarch
+python scripts/verify.py --platform batocera
+python scripts/verify.py --platform recalbox
+```
+
+The output shows each expected file with its status: OK, MISSING, or HASH MISMATCH. Platforms that verify by MD5 (Batocera, Recalbox, EmuDeck) will catch wrong versions. RetroArch only checks that files exist.
+
+For a single system:
+
+```bash
+python scripts/verify.py --system sony-playstation
+```
+
+For a single emulator core:
+
+```bash
+python scripts/verify.py --emulator beetle_psx
+```
+
+See [Tools](tools.md) for the full CLI reference.
+
+## Next steps
+
+- [FAQ](faq.md) - common questions and troubleshooting
+- [Tools](tools.md) - all available scripts and options
+- [Architecture](architecture.md) - how the project works internally
@@ -2,18 +2,50 @@

 Technical documentation for the RetroBIOS toolchain.

-## Pages
-
- **[Architecture](architecture.md)** - directory structure, data flow, platform inheritance, pack grouping, security, edge cases, CI workflows
- **[Tools](tools.md)** - CLI reference for every script, pipeline usage, scrapers
- **[Profiling guide](profiling.md)** - how to create an emulator profile from source code, step by step, with YAML field reference
- **[Data model](data-model.md)** - database.json structure, indexes, file resolution order, YAML formats
-
 ## For users

+- **[Getting started](getting-started.md)** - installation, BIOS directory paths per platform, verification
+- **[FAQ](faq.md)** - common questions, troubleshooting, hash explanations
+
 If you just want to download BIOS packs, see the [home page](../index.md).

+## Technical reference
+
+- **[Architecture](architecture.md)** - directory structure, data flow, platform inheritance, pack grouping, security, edge cases, CI workflows
+- **[Tools](tools.md)** - CLI reference for every script, pipeline usage, scrapers
+- **[Advanced usage](advanced-usage.md)** - custom packs, target filtering, truth generation, emulator verification, offline workflow
+- **[Verification modes](verification-modes.md)** - how each platform verifies BIOS files, severity matrix, resolution chain
+- **[Data model](data-model.md)** - database.json structure, indexes, file resolution order, YAML formats
+- **[Troubleshooting](troubleshooting.md)** - diagnosis by symptom: missing BIOS, hash mismatch, pack issues, verify errors
+
 ## For contributors

-Start with the [profiling guide](profiling.md) to understand how emulator profiles are built,
-then see [contributing](../contributing.md) for submission guidelines.
+- **[Profiling guide](profiling.md)** - create an emulator profile from source code, YAML field reference
+- **[Adding a platform](adding-a-platform.md)** - scraper, registry, YAML config, exporter, target scraper, install detection
+- **[Adding a scraper](adding-a-scraper.md)** - plugin architecture, BaseScraper, parsers, target scrapers
+- **[Testing guide](testing-guide.md)** - run tests, fixture pattern, how to add tests, CI integration
+- **[Release process](release-process.md)** - CI workflows, large files, manual release
+
+See [contributing](../contributing.md) for submission guidelines.
+
+## Glossary
+
+- **BIOS** - firmware burned into console hardware, needed by emulators that rely on original boot code
+- **firmware** - system software loaded by a console at boot; used interchangeably with BIOS in this project
+- **HLE** - High-Level Emulation; software reimplementation of BIOS functions, avoids needing the original file
+- **hash** - fixed-length fingerprint of a file's contents; this project uses MD5, SHA1, SHA256, and CRC32
+- **platform** - a distribution that packages emulators (RetroArch, Batocera, Recalbox, EmuDeck, etc.)
+- **core** - an emulator packaged as a libretro plugin, loaded by RetroArch or compatible frontends
+- **profile** - a YAML file in `emulators/` documenting one core's BIOS requirements, verified against source code
+- **system** - a game console or computer being emulated (e.g. sony-playstation, nintendo-gameboy-advance)
+- **pack** - a ZIP archive containing all BIOS files needed by a specific platform
+- **ground truth** - the emulator's source code, treated as the authoritative reference for BIOS requirements
+- **cross-reference** - comparison of emulator profiles against platform configs to find undeclared files
+- **scraper** - a script that fetches BIOS requirement data from an upstream source (System.dat, es_bios.xml, etc.)
+- **exporter** - a script that converts ground truth data back into a platform's native format
+- **target** - a hardware architecture that a platform runs on (e.g. switch, rpi4, x86_64, steamos)
+- **variant** - an alternative version of a BIOS file (different revision, region, or dump), stored in `.variants/`
+- **required** - a file the core needs to function; determined by source code behavior
+- **optional** - a file the core functions without, possibly with reduced accuracy or missing features
+- **hle_fallback** - flag on a file indicating the core has an HLE path; absence is downgraded to INFO severity
+- **severity** - the urgency of a verification result: OK (verified), INFO (negligible), WARNING (degraded), CRITICAL (broken)
@@ -9,6 +9,34 @@ The source code is the reference because it reflects actual behavior.
 Documentation, .info files, and wikis are useful starting points
 but are verified against the code.

+### Source hierarchy
+
+Documentation and metadata are valuable starting points, but they can
+fall out of sync with the actual code over time. The desmume2015 .info
+file is a good illustration: it declares `firmware_count=3`, but the
+source code at the pinned version opens zero firmware files. Cross-checking
+against the source helps catch that kind of gap early.
+
+When sources conflict, priority follows the chain of actual execution:
+
+1. **Original emulator source** (ground truth, what the code actually does)
+2. **Libretro port** (may adapt paths, add compatibility shims, or drop features)
+3. **.info metadata** (declarative, may be outdated or copied from another core)
+
+For standalone emulators like BizHawk or amiberry, there is only one
+level. The emulator's own codebase is the single source of truth. No
+.info, no wrapper, no divergence to track.
+
+A note on libretro port differences: the most common change is path
+resolution. The upstream emulator loads files from the current working
+directory; the libretro wrapper redirects to `retro_system_directory`.
+This is normal adaptation, not a divergence worth documenting. Similarly,
+filename changes like `naomi2_eeprom.bin` becoming `n2_eeprom.bin` are
+often deliberate. RetroArch uses a single shared system directory for
+all cores, so the port renames files to prevent collisions between cores
+that emulate different systems but happen to use the same generic
+filenames. The upstream name goes in `aliases:`.
+
 ## Steps

 ### 1. Find the source code
@@ -21,9 +49,27 @@ Check these locations in order:

 Always clone both upstream and libretro port to compare.

+For libretro cores, cloning both repositories and diffing them reveals
+what the port changed. Path changes (fopen of a relative path becoming
+a system_dir lookup) are expected. What matters are file additions the
+port introduces, files the port dropped, or hash values that differ
+between the two codebases.
+
+If the source is hosted outside GitHub, it's worth exploring further. Emulator
+source on GitLab, Codeberg, SourceForge, Bitbucket, archive.org
+snapshots, and community mirror tarballs. Inspecting copyright headers
+or license strings in the libretro fork often points to the original
+author's site. The upstream code exists somewhere; it's worth continuing the search before concluding the source is unavailable.
+
+One thing worth noting: even when the same repository was analyzed for
+a related profile (e.g., fbneo for arcade systems), it helps to do a
+fresh pass for each new profile. When fbneo_neogeo was profiled, the
+NeoGeo subset referenced BIOS files that the main arcade analysis
+hadn't encountered. A fresh look avoids carrying over blind spots.
+
 ### 2. Trace file loading

-Read the code flow. Don't grep keywords by assumption.
+Read the code flow, tracing from the entry point.
 Each emulator has its own way of loading files.

 Look for:
@@ -34,6 +80,19 @@ Look for:
 - Hash validation (MD5, CRC32, SHA1 comparisons in code)
 - Size validation (`fseek`/`ftell`, `stat`, fixed buffer sizes)

+Grepping for "bios" or "firmware" across the source tree can be a
+useful first pass, but it may miss emulators that use different terms
+(bootrom, system ROM, IPL, program.rom) and can surface false matches
+from test fixtures or comments.
+
+A more reliable approach is starting from the entry point
+(`retro_load_game` for libretro, `main()` for standalone) and tracing
+the actual file-open calls forward. Each emulator has its own loading
+flow. Dolphin loads region-specific IPL files through a boot sequence
+object. BlastEm reads a list of ROM paths from a configuration
+structure. same_cdi opens CD-i BIOS files through a machine
+initialization routine. The loading flow varies widely between emulators.
+
 ### 3. Determine required vs optional

 This is decided by code behavior, not by judgment:
@@ -42,6 +101,18 @@ This is decided by code behavior, not by judgment:
 - **optional**: the core works with degraded functionality without it
 - **hle_fallback: true**: the core has a high-level emulation path when the file is missing

+The decision is based on the code's behavior. If the core crashes or
+refuses to boot without the file, it is required. If it continues with
+degraded functionality (missing boot animation, different fonts, reduced
+audio in menus), it is optional. This keeps the classification objective
+and consistent across all profiles.
+
+When a core has HLE (high-level emulation), the real BIOS typically
+gives better accuracy, but the core functions without it. These files
+are marked with `hle_fallback: true` and `required: false`. The file
+still ships in packs (better experience for the user), but its absence
+does not raise alarms during verification.
+
 ### 4. Document divergences

 When the libretro port differs from the upstream:
@@ -54,6 +125,18 @@ Path differences (current dir vs system_dir) are normal adaptation,
 not a divergence. Name changes (e.g. `naomi2_` to `n2_`) may be intentional
 to avoid conflicts in the shared system directory.

+RetroArch's system directory is shared by every installed core. When
+the libretro port renames a file, it is usually solving a real problem:
+two cores that both expect `bios.rom` would overwrite each other. The
+upstream name goes in `aliases:` and `mode: libretro` on the port-specific
+name, so both names are indexed.
+
+True divergences worth documenting are: files the port adds that the
+upstream never loads, files the upstream loads that the port dropped
+(a gap in the port), and hash differences in embedded ROM data between
+the two codebases. These get noted in the profile because they affect
+what the user actually needs to provide.
+
 ### 5. Write the YAML profile

 ```yaml
@@ -80,6 +163,46 @@ files:
    source_ref: Source/Core/Core/Boot/Boot_BS2Emu.cpp:42
 ```

+### Writing style
+
+Notes in a profile describe what the core does, kept focused on:
+what files get loaded, how, and from where. Comparisons with other
+cores, disclaimers, and feature coverage beyond file requirements
+belong in external documentation. The profile is a technical spec.
+
+Profiles are standalone documentation. Someone should be able to take
+a single YAML file and integrate it into their own project without
+knowing anything about this repository's database, directory layout,
+or naming conventions. The YAML documents what the emulator expects.
+The tooling resolves the YAML against the local file collection
+separately.
+
+A few field conventions that protect the toolchain:
+
+- `type:` is operational. `resolve_platform_cores()` uses it to filter
+  which profiles apply to a platform. Valid values are `libretro`,
+  `standalone + libretro`, `standalone`, `alias`, `launcher`, `game`,
+  `utility`, `test`. Putting a classification concept here (like
+  "bizhawk-native") breaks the filtering. A BizHawk core is
+  `type: standalone`.
+
+- `core_classification:` is descriptive. It documents the relationship
+  between the core and the original emulator (pure_libretro,
+  official_port, community_fork, frozen_snapshot, etc.). It has no
+  effect on tooling behavior.
+
+- Alternative filenames go in `aliases:` on the file entry (rather than
+  as separate entries in platform YAMLs or `_shared.yml`). When the same
+  physical ROM is known by three names across different platforms, one
+  name is `name:` and the rest are `aliases:`.
+
+- Hashes come from source code. If the source has a hardcoded hex
+  string (like emuscv's `635a978...` in memory.cpp), that goes in. If
+  the source embeds ROM data as byte arrays (like ep128emu's roms.hpp),
+  the bytes can be extracted and hashed. If the source performs no hash
+  check at all, the hash is omitted from the profile. The .info or docs
+  may list an MD5, but source confirmation makes it more reliable.
+
 ### 6. Validate

 ```bash
@@ -87,6 +210,38 @@ python scripts/cross_reference.py --emulator dolphin --json
 python scripts/verify.py --emulator dolphin
 ```

+### Lessons learned
+
+These are patterns that have come up while building profiles. Sharing
+them here in case they save time.
+
+**.info metadata can lag behind the code.** The desmume2015 .info
+declares `firmware_count=3`, but the core source at the pinned version
+never opens any firmware file. The .info is useful as a starting point
+but benefits from a cross-check against the actual code.
+
+**Fresh analysis per profile helps.** When fbneo was profiled for
+arcade systems, NeoGeo-specific BIOS files were outside the analysis
+scope. Profiling fbneo_neogeo later surfaced files the first pass
+hadn't covered. Doing a fresh pass for each profile, even on a
+familiar codebase, avoids carrying over blind spots.
+
+**Path adaptation vs real divergence.** The libretro wrapper changing
+`fopen("./rom.bin")` to load from `system_dir` is the standard
+porting pattern. The file is the same; only the directory resolution
+changed. True divergences (added/removed files, different embedded
+data) are the ones worth documenting.
+
+**Each core has its own loading logic.** snes9x and bsnes both
+emulate the Super Nintendo, but they handle the Super Game Boy BIOS
+and DSP firmware through different code paths. Checking the actual
+code for each core avoids assumptions based on a related profile.
+
+**Code over docs.** Wiki pages and README files sometimes reference
+files from older versions or a different fork. If the source code
+does not load a particular file, it can be left out of the profile
+even if documentation mentions it.
+
 ## YAML field reference

 ### Profile fields
@@ -94,18 +249,22 @@ python scripts/verify.py --emulator dolphin
 | Field | Required | Description |
 |-------|----------|-------------|
 | `emulator` | yes | display name |
-| `type` | yes | `libretro`, `standalone`, `standalone + libretro`, `alias`, `launcher` |
+| `type` | yes | `libretro`, `standalone`, `standalone + libretro`, `alias`, `launcher`, `game`, `utility`, `test` |
 | `core_classification` | no | `pure_libretro`, `official_port`, `community_fork`, `frozen_snapshot`, `enhanced_fork`, `game_engine`, `embedded_hle`, `alias`, `launcher` |
 | `source` | yes | libretro core repository URL |
 | `upstream` | no | original emulator repository URL |
 | `profiled_date` | yes | date of source analysis |
 | `core_version` | yes | version analyzed |
+| `display_name` | no | full display name (e.g. "Sega - Mega Drive (BlastEm)") |
 | `systems` | yes | list of system IDs this core handles |
-| `cores` | no | list of core names (default: profile filename) |
+| `cores` | no | list of upstream core names for buildbot/target matching |
+| `mode` | no | default mode: `standalone`, `libretro`, or `both` |
+| `verification` | no | how the core verifies BIOS: `existence` or `md5` |
 | `files` | yes | list of file entries |
 | `notes` | no | free-form technical notes |
-| `exclusion_note` | no | why the profile has no files |
-| `data_directories` | no | references to data dirs in `_data_dirs.yml` |
+| `exclusion_note` | no | why the profile has no files despite .info declaring firmware |
+| `analysis` | no | structured per-subsystem analysis (capabilities, supported modes) |
+| `platform_details` | no | per-system platform-specific details (paths, romsets, forced systems) |

 ### File entry fields

@@ -113,20 +272,20 @@ python scripts/verify.py --emulator dolphin
 |-------|-------------|
 | `name` | filename as the core expects it |
 | `required` | true if the core needs this file to function |
-| `system` | system ID this file belongs to |
+| `system` | system ID this file belongs to (for multi-system profiles) |
 | `size` | expected size in bytes |
+| `min_size`, `max_size` | size range when the code accepts a range |
 | `md5`, `sha1`, `crc32`, `sha256` | expected hashes from source code |
-| `validation` | list of checks the code performs: `size`, `crc32`, `md5`, `sha1` |
+| `validation` | checks the code performs: `size`, `crc32`, `md5`, `sha1`, `adler32`, `signature`, `crypto`. Can be a list or dict `{core: [...], upstream: [...]}` for divergent checks |
 | `aliases` | alternate filenames for the same file |
 | `mode` | `libretro`, `standalone`, or `both` |
 | `hle_fallback` | true if a high-level emulation path exists |
 | `category` | `bios` (default), `game_data`, `bios_zip` |
 | `region` | geographic region (e.g. `north-america`, `japan`) |
-| `source_ref` | source file and line number |
-| `path` | path relative to system directory |
+| `source_ref` | source file and line number (e.g. `boot.cpp:42`) |
+| `path` | destination path relative to system directory |
 | `description` | what this file is |
 | `note` | additional context |
-| `archive` | parent ZIP if this file is inside an archive |
-| `contents` | structure of files inside a BIOS ZIP |
-| `storage` | `embedded` (default), `external`, `user_provided` |
+| `contents` | structure of files inside a BIOS ZIP (`name`, `description`, `size`, `crc32`) |
+| `storage` | `large_file` for files > 50 MB stored as release assets |

@@ -0,0 +1,158 @@
+# Release Process
+
+This page documents the CI/CD pipeline: what each workflow does, how releases
+are built, and how to run the process manually.
+
+## CI workflows overview
+
+The project uses 4 GitHub Actions workflows. All use only official GitHub
+actions (`actions/checkout`, `actions/setup-python`, `actions/upload-pages-artifact`,
+`actions/deploy-pages`). No third-party actions.
+
+Budget target: ~175 minutes/month on the GitHub free tier.
+
+| Workflow | File | Trigger |
+|----------|------|---------|
+| Build & Release | `build.yml` | Push to `bios/**` or `platforms/**`, manual dispatch |
+| Deploy Site | `deploy-site.yml` | Push to main (platforms, emulators, wiki, scripts, database.json, mkdocs.yml), manual |
+| PR Validation | `validate.yml` | PR touching `bios/**` or `platforms/**` |
+| Weekly Sync | `watch.yml` | Cron Monday 06:00 UTC, manual dispatch |
+
+## build.yml - Build & Release
+
+Currently disabled (`if: false` on the release job) until pack generation is
+validated in production.
+
+**Trigger.** Push to `main` on `bios/**` or `platforms/**` paths, or manual
+`workflow_dispatch` with optional `force_release` flag to bypass rate limiting.
+
+**Concurrency.** Group `build`, cancel in-progress.
+
+**Steps:**
+
+1. Checkout, Python 3.12, install `pyyaml`
+2. Run `test_e2e`
+3. Rate limit check: skip if last release was less than 7 days ago (unless
+   `force_release` is set)
+4. Restore large files from the `large-files` release into `.cache/large/`
+5. Refresh data directories (`refresh_data_dirs.py`)
+6. Build packs (`generate_pack.py --all --output-dir dist/`)
+7. Create GitHub release with tag `v{YYYY.MM.DD}` (appends `.N` suffix if
+   a same-day release already exists)
+8. Clean up old releases, keeping the 3 most recent plus `large-files`
+
+**Release notes** include file count, total size, per-pack sizes, and the last
+15 non-merge commits touching `bios/` or `platforms/`.
+
+## deploy-site.yml - Deploy Documentation Site
+
+**Trigger.** Push to `main` when any of these paths change: `platforms/`,
+`emulators/`, `wiki/`, `scripts/generate_site.py`, `scripts/generate_readme.py`,
+`scripts/verify.py`, `scripts/common.py`, `database.json`, `mkdocs.yml`.
+Also manual dispatch.
+
+**Steps:**
+
+1. Checkout, Python 3.12
+2. Install `pyyaml`, `mkdocs-material`, `pymdown-extensions`
+3. Run `generate_site.py` (converts YAML data into MkDocs pages)
+4. Run `generate_readme.py` (rebuilds README.md and CONTRIBUTING.md)
+5. `mkdocs build` to produce the static site
+6. Upload artifact, deploy to GitHub Pages
+
+The site is deployed via the `github-pages` environment using the official
+`actions/deploy-pages` action.
+
+## validate.yml - PR Validation
+
+**Trigger.** Pull requests that modify `bios/**` or `platforms/**`.
+
+**Concurrency.** Per-PR group, cancel in-progress.
+
+Four parallel jobs:
+
+**validate-bios.** Diffs the PR to find changed BIOS files, runs
+`validate_pr.py --markdown` on each, and posts the validation report as a PR
+comment (hash verification, database match status).
+
+**validate-configs.** Validates all platform YAML files against
+`schemas/platform.schema.json` using `jsonschema`. Fails if any config does
+not match the schema.
+
+**run-tests.** Runs `python -m unittest tests.test_e2e -v`. Must pass before
+merge.
+
+**label-pr.** Auto-labels the PR based on changed paths:
+
+| Path pattern | Label |
+|-------------|-------|
+| `bios/` | `bios` |
+| `bios/{Manufacturer}/` | `system:{manufacturer}` |
+| `platforms/` | `platform-config` |
+| `scripts/` | `automation` |
+
+## watch.yml - Weekly Platform Sync
+
+**Trigger.** Cron schedule every Monday at 06:00 UTC, or manual dispatch.
+
+**Flow:**
+
+1. Scrape live upstream sources (System.dat, batocera-systems, es_bios.xml,
+   etc.) and regenerate platform YAML configs
+2. Auto-fetch missing BIOS files
+3. Refresh data directories
+4. Run dedup
+5. Regenerate `database.json`
+6. Create or update a PR with labels `automated` and `platform-update`
+
+The PR contains all changes from the scrape cycle. A maintainer reviews and
+merges.
+
+## Large files management
+
+Files larger than 50 MB are stored as assets on a permanent GitHub release
+named `large-files` (to keep the git repository lightweight).
+
+Known large files: PS3UPDAT.PUP, PSVUPDAT.PUP, PSP2UPDAT.PUP, dsi_nand.bin,
+maclc3.zip, Firmware.19.0.0.zip (Switch).
+
+**Storage.** Listed in `.gitignore` so they stay out of git history. The
+`large-files` release is excluded from cleanup (the build workflow only
+deletes version-tagged releases).
+
+**Build-time restore.** The build workflow downloads all assets from
+`large-files` into `.cache/large/` and copies them to their expected paths
+before pack generation.
+
+**Upload.** To add or update a large file:
+
+```bash
+gh release upload large-files "bios/Sony/PS3/PS3UPDAT.PUP#PS3UPDAT.PUP"
+```
+
+**Local cache.** `generate_pack.py` calls `fetch_large_file()` which downloads
+from the release and caches in `.cache/large/` for subsequent runs.
+
+## Manual release process
+
+When `build.yml` is disabled, build and release manually:
+
+```bash
+# Run the full pipeline (DB + verify + packs + consistency check)
+python scripts/pipeline.py --offline
+
+# Or step by step:
+python scripts/generate_db.py --force --bios-dir bios --output database.json
+python scripts/verify.py --all
+python scripts/generate_pack.py --all --output-dir dist/
+
+# Create the release
+DATE=$(date +%Y.%m.%d)
+gh release create "v${DATE}" dist/*.zip \
+  --title "BIOS Pack v${DATE}" \
+  --notes "Release notes here" \
+  --latest
+```
+
+To re-enable automated releases, remove the `if: false` guard from the
+`release` job in `build.yml`.
@@ -0,0 +1,148 @@
+# Testing Guide
+
+This page covers how to run, understand, and extend the test suite.
+
+All tests use synthetic fixtures. No real BIOS files, platform configs, or
+network access required.
+
+## Running tests
+
+Run a single test module:
+
+```bash
+python -m unittest tests.test_e2e -v
+python -m unittest tests.test_mame_parser -v
+python -m unittest tests.test_fbneo_parser -v
+python -m unittest tests.test_hash_merge -v
+```
+
+Run the full suite:
+
+```bash
+python -m unittest discover tests -v
+```
+
+The only dependency is `pyyaml`. No test framework beyond the standard
+library `unittest` module.
+
+## Test architecture
+
+### test_e2e.py
+
+The main regression suite. A single `TestE2E` class exercises every code path
+through the resolution, verification, pack generation, and cross-reference
+logic.
+
+**Fixture pattern.** `setUp` creates a temporary directory tree with:
+
+- Fake BIOS files (deterministic content for hash computation)
+- Platform YAML configs (existence mode, MD5 mode, inheritance, shared groups)
+- Emulator profile YAMLs (required/optional files, aliases, HLE, standalone)
+- A synthetic `database.json` keyed by SHA1
+
+`tearDown` removes the temporary tree.
+
+**Test numbering.** Tests are grouped by category:
+
+| Range | Category |
+|-------|----------|
+| `test_01`--`test_14` | File resolution (SHA1, MD5, name, alias, truncated MD5, composite, zip contents, variants, hash mismatch) |
+| `test_20`--`test_31` | Verification (existence mode, MD5 mode, required/optional severity, zipped file, multi-hash) |
+| `test_40`--`test_47` | Cross-reference (undeclared files, standalone skip, alias profiles, data dir suppression, exclusion notes) |
+| `test_50`+ | Platform config (inheritance, shared groups, data directories, grouping, core resolution, target filtering, ground truth) |
+
+Each test calls the same functions that `verify.py` and `generate_pack.py` use
+in production, against the synthetic fixtures.
+
+### Parser tests
+
+**test_mame_parser.** Tests the MAME C source parser that extracts BIOS root
+sets from driver files. Fixtures are inline C source snippets containing
+`ROM_START`, `ROM_LOAD`, `GAME()`/`COMP()` macros with
+`MACHINE_IS_BIOS_ROOT`. Tests cover:
+
+- Standard `GAME` macro detection
+- `COMP` macro detection
+- `ROM_LOAD` / `ROMX_LOAD` parsing (name, size, CRC32, SHA1)
+- `ROM_SYSTEM_BIOS` variant extraction
+- Multi-region ROM blocks
+- Macro expansion and edge cases
+
+**test_fbneo_parser.** Tests the FBNeo C source parser that identifies
+`BDF_BOARDROM` sets. Same inline fixture approach.
+
+**test_hash_merge.** Tests the text-based YAML patching module used to merge
+upstream BIOS hashes into emulator profiles. Covers:
+
+- Merge operations (add new hashes, update existing)
+- Diff computation (detect what changed)
+- Formatting preservation (comments, ordering, flow style)
+
+Fixtures are programmatically generated YAML/JSON files written to a temp
+directory.
+
+## How to add a test
+
+1. **Pick the right category.** Find the number range that matches the
+   subsystem you are testing. If none fits, start a new range after the last
+   existing one.
+
+2. **Create synthetic fixtures.** Write the minimum YAML configs and fake
+   files needed to isolate the behavior. Use `tempfile.mkdtemp` for a clean
+   workspace. Avoid depending on the repo's real `bios/` or `platforms/`
+   directories.
+
+3. **Call production functions.** Import from `common`, `verify`, `validation`,
+   or `truth` and call the same entry points that the CLI scripts use. Do not
+   re-implement logic in tests.
+
+4. **Assert specific outcomes.** Check `Status`, `Severity`, resolution
+   method, file counts, or pack contents. Avoid brittle assertions on log
+   output or formatting.
+
+5. **Run the full suite.** After adding your test, run `python -m unittest
+   discover tests -v` to verify nothing else broke.
+
+Example skeleton:
+
+```python
+def test_42_my_new_behavior(self):
+    # Write minimal fixtures to self.root
+    profile = {"emulator": "test_core", "files": [...]}
+    with open(os.path.join(self.emulators_dir, "test_core.yml"), "w") as f:
+        yaml.dump(profile, f)
+
+    # Call production code
+    result = verify_platform(self.config, self.db, ...)
+
+    # Assert specific outcomes
+    self.assertEqual(result[0]["status"], Status.OK)
+```
+
+## Verification discipline
+
+The test suite is one layer of verification. The full quality gate is:
+
+1. All unit tests pass (`python -m unittest discover tests`)
+2. The full pipeline completes without error (`python scripts/pipeline.py --offline`)
+3. No unexpected CRITICAL entries in the verify output
+4. Pack file counts match verification file counts (consistency check)
+
+If a change passes tests but breaks the pipeline, it's worth investigating before merging. Similarly, new CRITICAL entries in the verify output after a change usually indicate something to look into. The pipeline is designed so that all steps agree: if verify reports N files for a platform, the pack should contain exactly N files.
+
+Ideally, tests, code, and documentation ship together. When profiles and platform configs are involved, updating them in the same change helps keep everything in sync.
+
+## CI integration
+
+The `validate.yml` workflow runs `test_e2e` on every pull request that touches
+`bios/` or `platforms/` files. The test job (`run-tests`) runs in parallel
+with BIOS validation, schema validation, and auto-labeling.
+
+Tests must pass before merge. If a test fails in CI, reproduce locally with:
+
+```bash
+python -m unittest tests.test_e2e -v 2>&1 | head -50
+```
+
+The `build.yml` workflow also runs the test suite before building release
+packs.
@@ -7,11 +7,35 @@ All tools are Python scripts in `scripts/`. Single dependency: `pyyaml`.
 Run everything in sequence:

 ```bash
-python scripts/pipeline.py --offline          # DB + verify + packs + readme + site
+python scripts/pipeline.py --offline              # DB + verify + packs + manifests + readme + site
 python scripts/pipeline.py --offline --skip-packs  # DB + verify only
-python scripts/pipeline.py --skip-docs        # skip readme + site generation
+python scripts/pipeline.py --offline --skip-docs   # skip readme + site generation
+python scripts/pipeline.py --offline --target switch  # filter by hardware target
+python scripts/pipeline.py --offline --with-truth  # include truth generation + diff
+python scripts/pipeline.py --offline --with-export # include native format export
+python scripts/pipeline.py --check-buildbot        # check buildbot data freshness
 ```

+Pipeline steps:
+
+| Step | Description | Skipped by |
+|------|-------------|------------|
+| 1/9 | Generate database | - |
+| 2/9 | Refresh data directories | `--offline` |
+| 2a | Refresh MAME BIOS hashes | `--offline` |
+| 2a2 | Refresh FBNeo BIOS hashes | `--offline` |
+| 2b | Check buildbot staleness | only with `--check-buildbot` |
+| 2c | Generate truth YAMLs | only with `--with-truth` / `--with-export` |
+| 2d | Diff truth vs scraped | only with `--with-truth` / `--with-export` |
+| 2e | Export native formats | only with `--with-export` |
+| 3/9 | Verify all platforms | - |
+| 4/9 | Generate packs | `--skip-packs` |
+| 4b | Generate install manifests | `--skip-packs` |
+| 4c | Generate target manifests | `--skip-packs` |
+| 5/9 | Consistency check | if verify or pack skipped |
+| 8/9 | Generate README | `--skip-docs` |
+| 9/9 | Generate site | `--skip-docs` |
+
 ## Individual tools

 ### generate_db.py
@@ -29,10 +53,16 @@ python scripts/generate_db.py --force --bios-dir bios --output database.json
 Check BIOS coverage for each platform using its native verification mode.

 ```bash
-python scripts/verify.py --all                # all platforms
-python scripts/verify.py --platform batocera  # single platform
-python scripts/verify.py --emulator dolphin   # single emulator
-python scripts/verify.py --system atari-lynx  # single system
+python scripts/verify.py --all                     # all platforms
+python scripts/verify.py --platform batocera       # single platform
+python scripts/verify.py --platform retroarch --verbose  # with ground truth details
+python scripts/verify.py --emulator dolphin        # single emulator
+python scripts/verify.py --emulator dolphin --standalone  # standalone mode only
+python scripts/verify.py --system atari-lynx       # single system
+python scripts/verify.py --platform retroarch --target switch  # filter by hardware
+python scripts/verify.py --list-emulators          # list all emulators
+python scripts/verify.py --list-systems            # list all systems
+python scripts/verify.py --platform retroarch --list-targets  # list available targets
 ```

 Verification modes per platform:
@@ -45,6 +75,7 @@ Verification modes per platform:
 | EmuDeck | md5 | MD5 whitelist per system |
 | RetroDECK | md5 | MD5 per file via component manifests |
 | RomM | md5 | size + any hash (MD5/SHA1/CRC32) |
+| BizHawk | sha1 | SHA1 per firmware from FirmwareDatabase.cs |

 ### generate_pack.py

@@ -67,6 +98,14 @@ python scripts/generate_pack.py --platform retroarch --split --group-by manufact
 python scripts/generate_pack.py --from-md5 d8f1206299c48946e6ec5ef96d014eaa
 python scripts/generate_pack.py --platform batocera --from-md5-file missing.txt
 python scripts/generate_pack.py --platform retroarch --list-systems
+
+# Hardware target filtering
+python scripts/generate_pack.py --all --target x86_64
+python scripts/generate_pack.py --platform retroarch --target switch
+
+# Install manifests (consumed by install.py)
+python scripts/generate_pack.py --all --manifest --output-dir install/
+python scripts/generate_pack.py --manifest-targets --output-dir install/targets/
 ```

 Packs include platform baseline files plus files required by the platform's cores.
@@ -82,17 +121,45 @@ If none exists, the platform version is kept and the discrepancy is reported.
 - `--split --group-by manufacturer`: group split packs by manufacturer (Sony, Nintendo, Sega...)
 - `--from-md5`: look up a hash in the database, or build a custom pack with `--platform`/`--emulator`
 - `--from-md5-file`: same, reading hashes from a file (one per line, comments with #)
+- `--target`: filter by hardware target (e.g. `switch`, `rpi4`, `x86_64`)

 ### cross_reference.py

 Compare emulator profiles against platform configs.
-Reports files that cores need but platforms don't declare.
+Reports files that cores need beyond what platforms declare.

 ```bash
 python scripts/cross_reference.py                    # all
-python scripts/cross_reference.py --emulator dolphin # single
+python scripts/cross_reference.py --emulator dolphin  # single
+python scripts/cross_reference.py --emulator dolphin --json  # JSON output
 ```

+### truth.py, generate_truth.py, diff_truth.py
+
+Generate ground truth from emulator profiles, diff against scraped platform data.
+
+```bash
+python scripts/generate_truth.py --platform retroarch     # single platform truth
+python scripts/generate_truth.py --all --output-dir dist/truth/  # all platforms
+python scripts/diff_truth.py --platform retroarch         # diff truth vs scraped
+python scripts/diff_truth.py --all                        # diff all platforms
+```
+
+### export_native.py
+
+Export truth data to native platform formats (System.dat, es_bios.xml, checkBIOS.sh, etc.).
+
+```bash
+python scripts/export_native.py --platform batocera
+python scripts/export_native.py --all --output-dir dist/upstream/
+```
+
+### validation.py
+
+Validation index and ground truth formatting. Used by verify.py for emulator-level checks
+(size, CRC32, MD5, SHA1, crypto). Separates reproducible hash checks from cryptographic
+validations that require console-specific keys.
+
 ### refresh_data_dirs.py

 Fetch data directories (Dolphin Sys, PPSSPP assets, blueMSX databases)
@@ -107,24 +174,45 @@ python scripts/refresh_data_dirs.py --key dolphin-sys --force

 | Script | Purpose |
 |--------|---------|
+| `common.py` | Shared library: hash computation, file resolution, platform config loading, emulator profiles, target filtering |
 | `dedup.py` | Deduplicate `bios/`, move duplicates to `.variants/`. RPG Maker and ScummVM excluded (NODEDUP) |
-| `validate_pr.py` | Validate BIOS files in pull requests |
-| `auto_fetch.py` | Fetch missing BIOS files from known sources |
+| `validate_pr.py` | Validate BIOS files in pull requests, post markdown report |
+| `auto_fetch.py` | Fetch missing BIOS files from known sources (4-step pipeline) |
 | `list_platforms.py` | List active platforms (used by CI) |
-| `download.py` | Download packs from GitHub releases |
-| `common.py` | Shared library: hash computation, file resolution, platform config loading, emulator profiles |
+| `download.py` | Download packs from GitHub releases (Python, multi-threaded) |
 | `generate_readme.py` | Generate README.md and CONTRIBUTING.md from database |
 | `generate_site.py` | Generate all MkDocs site pages (this documentation) |
 | `deterministic_zip.py` | Rebuild MAME BIOS ZIPs deterministically (same ROMs = same hash) |
 | `crypto_verify.py` | 3DS RSA signature and AES crypto verification |
 | `sect233r1.py` | Pure Python ECDSA verification on sect233r1 curve (3DS OTP cert) |
-| `batch_profile.py` | Batch profiling automation for libretro cores |
+| `check_buildbot_system.py` | Detect stale data directories by comparing with buildbot |
 | `migrate.py` | Migrate flat bios structure to Manufacturer/Console/ hierarchy |

+## Installation tools
+
+Cross-platform BIOS installer for end users:
+
+```bash
+# Python installer (auto-detects platform)
+python install.py
+
+# Shell one-liner (Linux/macOS)
+bash scripts/download.sh retroarch ~/RetroArch/system/
+bash scripts/download.sh --list
+
+# Or via install.sh wrapper (detects curl/wget, runs install.py)
+bash install.sh
+```
+
+`install.py` auto-detects the user's platform by checking config files,
+downloads the matching BIOS pack from GitHub releases with SHA1 verification,
+and extracts files to the correct directory. `install.ps1` provides
+equivalent functionality for Windows/PowerShell.
+
 ## Large files

 Files over 50 MB are stored as assets on the `large-files` GitHub release.
-They are listed in `.gitignore` so they don't bloat the git repository.
+They are listed in `.gitignore` to keep the git repository lightweight.
 `generate_db.py` downloads them from the release when rebuilding the database,
 using `fetch_large_file()` from `common.py`. The same function is used by
 `generate_pack.py` when a file has a hash mismatch with the local variant.
@@ -141,11 +229,49 @@ Located in `scripts/scraper/`. Each inherits `BaseScraper` and implements `fetch
 | `retrobat_scraper` | batocera-systems.json | JSON |
 | `emudeck_scraper` | checkBIOS.sh | Bash + CSV |
 | `retrodeck_scraper` | component manifests | JSON per component |
+| `romm_scraper` | known_bios_files.json | JSON |
 | `coreinfo_scraper` | .info files from libretro-core-info | INI-like |
+| `bizhawk_scraper` | FirmwareDatabase.cs | C# source |
+| `mame_hash_scraper` | mamedev/mame source tree | C source (sparse clone) |
+| `fbneo_hash_scraper` | FBNeo source tree | C source (sparse clone) |

 Internal modules: `base_scraper.py` (abstract base with `_fetch_raw()` caching
-and shared CLI), `dat_parser.py` (clrmamepro DAT format parser).
+and shared CLI), `dat_parser.py` (clrmamepro DAT format parser),
+`mame_parser.py` (MAME C source BIOS root set parser),
+`fbneo_parser.py` (FBNeo C source BIOS set parser),
+`_hash_merge.py` (text-based YAML patching that preserves formatting).

 Adding a scraper: inherit `BaseScraper`, implement `fetch_requirements()`,
 call `scraper_cli(YourScraper)` in `__main__`.

+## Target scrapers
+
+Located in `scripts/scraper/targets/`. Each inherits `BaseTargetScraper` and implements `fetch_targets()`.
+
+| Scraper | Source | Targets |
+|---------|--------|---------|
+| `retroarch_targets_scraper` | libretro buildbot nightly | 20+ architectures |
+| `batocera_targets_scraper` | Config.in + es_systems.yml | 35+ boards |
+| `emudeck_targets_scraper` | EmuScripts GitHub API | steamos, windows |
+| `retropie_targets_scraper` | scriptmodules + rp_module_flags | 7 platforms |
+
+```bash
+python -m scripts.scraper.targets.retroarch_targets_scraper --dry-run
+python -m scripts.scraper.targets.batocera_targets_scraper --dry-run
+```
+
+## Exporters
+
+Located in `scripts/exporter/`. Each inherits `BaseExporter` and implements `export()`.
+
+| Exporter | Output format |
+|----------|--------------|
+| `systemdat_exporter` | clrmamepro DAT (RetroArch System.dat) |
+| `batocera_exporter` | Python dict (batocera-systems) |
+| `recalbox_exporter` | XML (es_bios.xml) |
+| `retrobat_exporter` | JSON (batocera-systems.json) |
+| `emudeck_exporter` | Bash script (checkBIOS.sh) |
+| `retrodeck_exporter` | JSON (component_manifest.json) |
+| `romm_exporter` | JSON (known_bios_files.json) |
+| `lakka_exporter` | clrmamepro DAT (delegates to systemdat) |
+| `retropie_exporter` | clrmamepro DAT (delegates to systemdat) |
@@ -0,0 +1,243 @@
+# Troubleshooting - RetroBIOS
+
+Diagnosis guide organized by symptom. Each section describes what to check and how to fix it.
+
+## Game won't start / black screen
+
+Most launch failures are caused by a missing or incorrect BIOS file.
+
+**Check if the BIOS exists:**
+
+```bash
+python scripts/verify.py --platform retroarch --verbose
+python scripts/verify.py --system sony-playstation
+```
+
+Look for `MISSING` entries in the output. A missing required BIOS means the core
+cannot start games for that system at all.
+
+**Check if the hash matches:**
+
+Look for `HASH_MISMATCH` in the verify output. This means the file exists but
+contains different data than expected. Common causes:
+
+- Wrong region (a PAL BIOS instead of NTSC, or vice versa)
+- Wrong hardware revision (e.g. SCPH-5501 vs SCPH-1001 for PlayStation)
+- Corrupted download
+
+Each system page on the site lists the expected hashes. Compare your file's
+MD5 or SHA1 against those values.
+
+**Wrong region BIOS:**
+
+Some cores require region-specific BIOS files. A Japanese BIOS won't boot
+North American games on cores that enforce region matching. Check the emulator
+profile for your core to see which regions are supported and which files
+correspond to each.
+
+## BIOS not found by emulator
+
+The file exists on disk, but the emulator reports it as missing.
+
+**Wrong directory:**
+
+Each platform expects BIOS files in a specific base directory:
+
+- RetroArch, Lakka: `system/` inside the RetroArch directory
+- Batocera: `/userdata/bios/`
+- Recalbox: `/recalbox/share/bios/`
+- RetroPie: `~/RetroPie/BIOS/`
+
+Some cores expect files in subdirectories (e.g. `dc/` for Dreamcast, `pcsx2/bios/`
+for PlayStation 2). Check the `path:` field in the emulator profile for the exact
+expected location relative to the base directory.
+
+**Wrong filename:**
+
+Cores match BIOS files by exact filename. If a core expects `scph5501.bin` and your
+file is named `SCPH-5501.BIN`, it won't be found on platforms that do exact name matching.
+
+Check the emulator profile for the expected filename and any aliases listed under
+`aliases:`. Aliases are alternative names that the core also accepts.
+
+**Case sensitivity:**
+
+Linux filesystems are case-sensitive. A file named `Bios.ROM` won't match a lookup
+for `bios.rom`. Windows and macOS are case-insensitive by default, so the same
+file works there but fails on Linux.
+
+Batocera's verification uses `casefold()` for case-insensitive matching, but
+the actual emulator may still require exact case. When in doubt, use the exact
+filename from the emulator profile.
+
+## Hash mismatch / UNTESTED
+
+`verify.py` reports `HASH_MISMATCH` or `UNTESTED` for a file.
+
+**HASH_MISMATCH:**
+
+The file exists and was hashed, but the computed hash doesn't match any expected
+value. This means you have a different version of the file than what the platform
+or emulator expects.
+
+To find the correct version, check the system page on the site. It lists every
+known BIOS file with its expected MD5 and SHA1.
+
+**UNTESTED:**
+
+On existence-only platforms (RetroArch, Lakka, RetroPie), the file is present
+but its hash was not verified against a known value. The platform itself only
+checks that the file exists. The `--verbose` flag shows ground truth data from
+emulator profiles, which can confirm whether the file's hash is actually correct.
+
+**The .variants/ directory:**
+
+When multiple versions of the same BIOS exist (different revisions, regions, or
+dumps), the primary version lives in the main directory and alternatives live in
+`.variants/`. `verify.py` checks the primary file first, then falls back to
+variants when resolving by hash.
+
+If your file matches a variant hash but not the primary, it's a valid BIOS --
+just not the preferred version. Some cores accept multiple versions.
+
+## Pack is missing files
+
+A generated pack doesn't contain all the files you expected.
+
+**Severity levels:**
+
+`verify.py` assigns a severity to each issue. Not all missing files are equally
+important:
+
+| Severity | Meaning | Action needed |
+|----------|---------|---------------|
+| CRITICAL | Required file missing or hash mismatch on MD5 platforms | Must fix. Core won't function. |
+| WARNING | Optional file missing, or hash mismatch on existence platforms | Core works but with reduced functionality. |
+| INFO | Optional file missing on existence-only platforms, or HLE fallback available | Core works fine, BIOS improves accuracy. |
+| OK | File present and verified | No action needed. |
+
+Focus on CRITICAL issues first. WARNING files improve the experience but aren't
+strictly necessary. INFO files are nice to have.
+
+**Large files (over 50 MB):**
+
+Files like PS3UPDAT.PUP, PSVUPDAT.PUP, and Switch firmware are too large for the
+git repository. They are stored as GitHub release assets under the `large-files`
+release and downloaded at build time.
+
+If a pack build fails to include these, check your network connection. In offline
+mode (`--offline`), large files are only included if already cached locally in
+`.cache/large/`.
+
+**Data directories:**
+
+Some cores need entire directory trees rather than individual files (e.g. Dolphin's
+`Sys/` directory, PPSSPP's `assets/`). These are fetched by `refresh_data_dirs.py`
+from upstream repositories.
+
+In offline mode, data directories are only included if already cached in `data/`.
+Run `python scripts/refresh_data_dirs.py` to fetch them.
+
+## verify.py reports errors
+
+How to read and interpret `verify.py` output.
+
+**Status codes:**
+
+| Status | Meaning |
+|--------|---------|
+| `ok` | File present, hash matches (or existence check passed) |
+| `untested` | File present, hash not confirmed (existence-only platforms) |
+| `missing` | File not found in the repository |
+| `hash_mismatch` | File found but hash doesn't match expected value |
+| `size_mismatch` | File found but size doesn't match what the emulator expects |
+
+**Reading the output:**
+
+Each line shows the file path, its status, and severity. In verbose mode, ground
+truth data from emulator profiles is appended, showing which cores reference the
+file and what validations they perform.
+
+```
+scph5501.bin        ok       [OK]
+dc_boot.bin         missing  [CRITICAL]
+gba_bios.bin        untested [WARNING]
+```
+
+**Cross-reference section:**
+
+After per-file results, `verify.py` prints a cross-reference report. This lists
+files that emulator cores need but that the platform YAML doesn't declare. These
+files are still included in packs automatically, but the report helps identify
+gaps in platform coverage data.
+
+The cross-reference uses `resolve_platform_cores()` to determine which emulator
+profiles are relevant for each platform, then checks whether each profile's files
+appear in the platform config.
+
+**Filtering output:**
+
+```bash
+# By platform
+python scripts/verify.py --platform batocera
+
+# By emulator core
+python scripts/verify.py --emulator beetle_psx
+
+# By system
+python scripts/verify.py --system sony-playstation
+
+# By hardware target
+python scripts/verify.py --platform retroarch --target switch
+
+# JSON for scripted processing
+python scripts/verify.py --platform retroarch --json
+```
+
+## Installation script fails
+
+Problems with `install.py`, `install.sh`, or `download.sh`.
+
+**Network issues:**
+
+The installer downloads packs from GitHub releases. If the download fails:
+
+- Check your internet connection
+- Verify that `https://github.com` is reachable
+- If behind a proxy, set `HTTPS_PROXY` in your environment
+- Try again later if GitHub is experiencing issues
+
+**Permission denied:**
+
+The installer needs write access to the target directory.
+
+- On Linux/macOS: check directory ownership (`ls -la`) and run with appropriate
+  permissions. Avoid running as root unless the target directory requires it.
+- On Windows: run PowerShell as Administrator if installing to a protected directory.
+
+**Platform not detected:**
+
+`install.py` auto-detects your platform by checking for known config files. If
+detection fails, specify the platform manually:
+
+```bash
+python install.py --platform retroarch --dest ~/RetroArch/system/
+python install.py --platform batocera --dest /userdata/bios/
+```
+
+Use `python install.py --help` to see all available platforms and options.
+
+**Pack not found in release:**
+
+If the installer reports that no pack exists for your platform, check available
+releases:
+
+```bash
+python scripts/download.py --list
+# or
+bash scripts/download.sh --list
+```
+
+Some platforms share packs (Lakka uses the RetroArch pack). The installer handles
+this mapping automatically, but if you're downloading manually, check which pack
+name corresponds to your platform.
@@ -0,0 +1,248 @@
+# Verification Modes
+
+Each platform verifies BIOS files differently. `verify.py` replicates the native behavior
+of each platform so that verification results match what the platform itself would report.
+
+
+## Existence Mode
+
+**Platforms**: RetroArch, Lakka, RetroPie
+
+**Source**: RetroArch `core_info.c`, function `path_is_valid()`
+
+The most straightforward mode. A file is OK if it exists at the expected path. No hash is checked.
+Any file with the correct name passes, regardless of content.
+
+| Condition | Status | Severity (required) | Severity (optional) |
+|-----------|--------|---------------------|---------------------|
+| File present | OK | OK | OK |
+| File missing | MISSING | WARNING | INFO |
+
+RetroArch does not distinguish between a correct and an incorrect BIOS at the verification
+level. A corrupt or wrong-region file still shows as present. This is by design in the
+upstream code: `core_info.c` only calls `path_is_valid()` and does not open or hash the file.
+
+Lakka and RetroPie inherit this behavior through platform config inheritance
+(`inherits: retroarch` in the platform YAML).
+
+
+## MD5 Mode
+
+**Platforms**: Batocera, RetroBat, Recalbox, EmuDeck, RetroDECK, RomM
+
+All MD5-mode platforms compute a hash of the file and compare it against an expected value.
+The details vary by platform.
+
+### Standard MD5 (Batocera, RetroBat)
+
+`verify.py` replicates Batocera's `md5sum()` function. The file is read in binary mode,
+hashed with MD5, and compared case-insensitively against the expected value.
+
+| Condition | Status | Severity (required) | Severity (optional) |
+|-----------|--------|---------------------|---------------------|
+| Hash matches | OK | OK | OK |
+| File present, hash differs | UNTESTED | WARNING | WARNING |
+| File missing | MISSING | CRITICAL | WARNING |
+
+If the `resolve_local_file` step already confirmed the MD5 match (status `md5_exact`),
+`verify.py` skips re-hashing and returns OK directly.
+
+### Truncated MD5 (Batocera bug)
+
+Some entries in Batocera's system data contain 29-character MD5 strings instead of
+the standard 32. This is a known upstream bug. `verify.py` handles it by prefix matching:
+if the expected hash is shorter than 32 characters, the actual hash is compared against
+only its first N characters.
+
+### md5_composite (Recalbox ZIP verification)
+
+Recalbox computes `Zip::Md5Composite` for ZIP files: the MD5 of the concatenation of all
+inner file MD5s (sorted by filename). `verify.py` replicates this with `md5_composite()`
+from `common.py`. When a ZIP file's direct MD5 does not match, the composite is tried
+before reporting a mismatch.
+
+### Multi-hash (Recalbox)
+
+Recalbox allows comma-separated MD5 values for a single file entry, accepting any one
+of them as valid. `verify.py` splits on commas and tries each hash. A match against any
+listed hash is OK.
+
+### Mandatory levels (Recalbox)
+
+Recalbox uses three severity levels derived from two YAML fields (`mandatory` and
+`hashMatchMandatory`):
+
+| mandatory | hashMatchMandatory | Color  | verify.py mapping |
+|-----------|--------------------|--------|-------------------|
+| true      | true               | RED    | CRITICAL          |
+| true      | false              | YELLOW | WARNING           |
+| false     | (any)              | GREEN  | INFO              |
+
+### checkInsideZip (Batocera zippedFile)
+
+When a platform entry has a `zipped_file` field, the expected MD5 is not the hash of the
+ZIP container but of a specific ROM file inside the ZIP. `verify.py` replicates Batocera's
+`checkInsideZip()`:
+
+1. Open the ZIP.
+2. Find the inner file by name (case-insensitive via `casefold()`).
+3. Read its contents and compute MD5.
+4. Compare against the expected hash.
+
+If the inner file is not found inside the ZIP, the status is UNTESTED with a reason string.
+
+### RomM verification
+
+RomM checks both file size and hash. It accepts any hash type (MD5, SHA1, or CRC32).
+ZIP files are not opened; only the container is checked. `verify.py` replicates this
+by checking size first, then trying each available hash.
+
+
+## SHA1 Mode
+
+**Platforms**: BizHawk
+
+BizHawk firmware entries use SHA1 as the primary hash. `verify.py` computes SHA1
+via `compute_hashes()` and compares case-insensitively.
+
+| Condition | Status | Severity (required) | Severity (optional) |
+|-----------|--------|---------------------|---------------------|
+| SHA1 matches | OK | OK | OK |
+| File present, SHA1 differs | UNTESTED | WARNING | WARNING |
+| File missing | MISSING | CRITICAL | WARNING |
+
+
+## Emulator-Level Validation
+
+Independent of platform verification mode, `verify.py` runs emulator-level validation
+from `validation.py`. This layer uses data from emulator profiles (YAML files in
+`emulators/`), which are source-verified against emulator code.
+
+### Validation index
+
+`_build_validation_index()` reads all emulator profiles and builds a per-filename
+index of validation rules. When multiple emulators reference the same file, checks
+are merged (union of all check types). Conflicting expected values are kept as sets
+(e.g., multiple accepted CRC32 values for different ROM versions).
+
+Each entry in the index tracks:
+
+- `checks`: list of validation types (e.g., `["size", "crc32"]`)
+- `sizes`: set of accepted exact sizes
+- `min_size`, `max_size`: bounds when the code accepts a range
+- `crc32`, `md5`, `sha1`, `sha256`: sets of accepted hash values
+- `adler32`: set of accepted Adler-32 values
+- `crypto_only`: non-reproducible checks (see below)
+- `per_emulator`: per-core detail with source references
+
+### Check categories
+
+Validation checks fall into two categories:
+
+**Reproducible** (`_HASH_CHECKS`): `crc32`, `md5`, `sha1`, `adler32`. These can be
+computed from the file alone. `verify.py` calculates hashes and compares against
+accepted values from the index.
+
+**Non-reproducible** (`_CRYPTO_CHECKS`): `signature`, `crypto`. These require
+console-specific cryptographic keys (e.g., RSA-2048 for 3DS, AES-128-CBC for certain
+firmware). `verify.py` reports these as informational but cannot verify them without
+the keys. Size checks still apply if combined with crypto.
+
+### Size validation
+
+Three forms:
+
+- **Exact size**: `size: 524288` with `validation: [size]`. File must be exactly this many bytes.
+- **Range**: `min_size: 40`, `max_size: 131076` with `validation: [size]`. File size must fall within bounds.
+- **Informational**: `size: 524288` without `validation: [size]`. The size is documented but the emulator does not check it at runtime.
+
+### Complement to platform checks
+
+Emulator validation runs after platform verification. When a file passes platform checks
+(e.g., existence-mode OK) but fails emulator validation (e.g., wrong CRC32), the result
+includes a `discrepancy` field:
+
+```
+file present (OK) but handy says size mismatch: got 256, accepted [512]
+```
+
+This catches cases where a file has the right name but wrong content, which existence-mode
+platforms cannot detect.
+
+
+## Severity Matrix
+
+`compute_severity()` maps the combination of status, required flag, verification mode,
+and HLE fallback to a severity level.
+
+| Mode | Status | required | hle_fallback | Severity |
+|------|--------|----------|--------------|----------|
+| any | OK | any | any | OK |
+| any | MISSING | any | true | INFO |
+| existence | MISSING | true | false | WARNING |
+| existence | MISSING | false | false | INFO |
+| md5/sha1 | MISSING | true | false | CRITICAL |
+| md5/sha1 | MISSING | false | false | WARNING |
+| md5/sha1 | UNTESTED | any | false | WARNING |
+
+**HLE fallback**: when an emulator profile marks a file with `hle_fallback: true`, the
+core has a built-in high-level emulation path and functions without the file. Missing
+files are downgraded to INFO regardless of platform mode or required status. The file
+is still included in packs (better accuracy with the real BIOS), but its absence is not
+actionable.
+
+
+## File Resolution Chain
+
+Before verification, each file entry is resolved to a local path by `resolve_local_file()`.
+The function tries these steps in order, returning the first match:
+
+| Step | Method | Returns | When it applies |
+|------|--------|---------|-----------------|
+| 0 | Path suffix exact | `exact` | `dest_hint` matches `by_path_suffix` index (regional variants with same filename, e.g., `GC/USA/IPL.bin` vs `GC/EUR/IPL.bin`) |
+| 1 | SHA1 exact | `exact` | SHA1 present in the file entry and found in database |
+| 2 | MD5 direct lookup | `md5_exact` | MD5 present, not a `zipped_file` entry, name matches (prevents cross-contamination from unrelated files sharing an MD5) |
+| 3 | Name/alias existence | `exact` | No MD5 in entry; any file with matching name or alias exists. Prefers primary over `.variants/` |
+| 4 | Name + md5_composite/MD5 | `exact` or `hash_mismatch` | Name matches, checks md5_composite for ZIPs and direct MD5 per candidate. Falls back to hash_mismatch if name matches but no hash does |
+| 5 | ZIP contents index | `zip_exact` | `zipped_file` with MD5; searches inner ROM MD5 across all ZIPs when name-based resolution failed |
+| 6 | MAME clone fallback | `mame_clone` | File was deduped; resolves via canonical set name (up to 3 levels deep) |
+| 7 | Data directory scan | `data_dir` | Searches `data/` caches by exact path then case-insensitive basename walk |
+| 8 | Agnostic fallback | `agnostic_fallback` | File entry marked `agnostic: true`; matches any file under the system path prefix within the size constraints |
+
+If no step matches, the result is `(None, "not_found")`.
+
+The `hash_mismatch` status at step 4 means a file with the right name exists but its hash
+does not match. This still resolves to a local path (the file is present), but verification
+will report it as UNTESTED with a reason string showing the expected vs actual hash prefix.
+
+
+## Discrepancy Detection
+
+When platform verification passes but emulator validation fails, the file has a discrepancy.
+This happens most often in existence-mode platforms where any file with the right name is
+accepted.
+
+### Variant search
+
+`_find_best_variant()` searches for an alternative file in the repository that satisfies
+both the platform MD5 requirement and emulator validation:
+
+1. Look up all files with the same name in the `by_name` index.
+2. Skip the current file (already known to fail validation).
+3. For each candidate, check that its MD5 matches the platform expectation.
+4. Run `check_file_validation()` against the candidate.
+5. Return the first candidate that passes both checks.
+
+The search covers files in `.variants/` (alternate hashes stored during deduplication).
+If a better variant is found, the pack uses it instead of the primary file. If no variant
+satisfies both constraints, the platform version is kept and the discrepancy is reported
+in the verification output.
+
+### Practical example
+
+A `scph5501.bin` file passes Batocera MD5 verification (hash matches upstream declaration)
+but fails the emulator profile's size check because the profile was verified against a
+different revision. `_find_best_variant` scans `.variants/scph5501.bin.*` for a file
+that matches both the Batocera MD5 and the emulator's size expectation. If found, the
+variant is used in the pack. If not, the Batocera-verified file is kept and the discrepancy
+is logged.