mirror of https://github.com/Abdess/retroarch_system.git synced 2026-04-13 12:22:33 -05:00

Files

Abdessamad Derraz d0dd05ddf6 docs: add wiki pages for all audiences, fix .old.yml leak

9 new wiki pages: getting-started, faq, troubleshooting,
advanced-usage, verification-modes, adding-a-platform,
adding-a-scraper, testing-guide, release-process.

Updated architecture.md with mermaid diagrams, tools.md with
full pipeline and target/exporter sections, profiling.md with
missing fields, index.md with glossary and nav links.

Expanded CONTRIBUTING.md from stub to full contributor guide.

Filter .old.yml from load_emulator_profiles, generate_db alias
collection, and generate_readme counts. Fix BizHawk sha1 mode
in tools.md, fix RetroPie path, fix export_truth.py typos.

2026-03-30 23:58:12 +02:00

14 KiB

Raw Blame History

Adding a scraper

How to create or modify a scraper for fetching BIOS requirements from upstream platform sources.

Scraper architecture

Plugin discovery

Scrapers are discovered automatically at import time. The scripts/scraper/__init__.py module uses pkgutil.iter_modules to scan for files matching *_scraper.py in the scraper directory. Each module must export:

PLATFORM_NAME: str - the platform identifier (matches _registry.yml)
Scraper: class - a subclass of BaseScraper

No registration code is needed. Drop a file, export the two names, and it works.

# scripts/scraper/__init__.py (simplified)
for finder, name, ispkg in pkgutil.iter_modules([package_dir]):
    if not name.endswith("_scraper"):
        continue
    module = importlib.import_module(f".{name}", package=__package__)
    # looks for PLATFORM_NAME and Scraper attributes

BaseScraper ABC

BaseScraper (scripts/scraper/base_scraper.py) provides the foundation:

BaseScraper
  __init__(url: str)
  _fetch_raw() -> str               # HTTP GET, cached, 50 MB limit
  fetch_requirements() -> list       # abstract: parse upstream data
  validate_format(raw_data) -> bool  # abstract: detect format changes
  compare_with_config(config) -> ChangeSet  # diff against existing YAML
  test_connection() -> bool          # reachability check

_fetch_raw() handles HTTP with urllib.request, sets a retrobios-scraper/1.0 user-agent, enforces a 30-second timeout, and reads the response in 64 KB chunks with a 50 MB hard limit to prevent memory exhaustion. The result is cached on the instance after the first call.

BiosRequirement

A dataclass representing a single BIOS file entry:

@dataclass
class BiosRequirement:
    name: str                        # filename
    system: str                      # retrobios system ID
    sha1: str | None = None
    md5: str | None = None
    crc32: str | None = None
    size: int | None = None
    destination: str = ""            # relative path in BIOS dir
    required: bool = True
    zipped_file: str | None = None   # ROM name inside a ZIP
    native_id: str | None = None     # original system name

ChangeSet

Returned by compare_with_config(). Contains:

added: list[BiosRequirement] - new files not in the existing config
removed: list[BiosRequirement] - files present in config but gone upstream
modified: list[tuple[BiosRequirement, BiosRequirement]] - hash changes
has_changes: bool - True if any of the above are non-empty
summary() -> str - human-readable summary (e.g. +3 added, ~1 modified)

scraper_cli

scraper_cli(scraper_class, description) provides a shared CLI with three modes:

Flag	Behavior
`--dry-run`	Fetch and print a summary grouped by system
`--json`	Output all requirements as JSON
`--output FILE`	Write platform YAML to FILE

When --output targets an existing file, the CLI preserves keys not generated by the scraper (e.g. data_directories, manual additions). Only the systems section is replaced. If the scraper defines generate_platform_yaml(), that method is used instead of the generic YAML builder.

Helper functions

Two additional functions in base_scraper.py:

fetch_github_latest_version(repo) - fetches the latest release tag via GitHub API.
fetch_github_latest_tag(repo, prefix) - fetches the most recent tag matching an optional prefix.

Creating a BIOS scraper

Minimal example

Based on the RomM scraper pattern (JSON source, flat structure):

"""Scraper for MyPlatform BIOS requirements."""
from __future__ import annotations

import json
import sys

try:
    from .base_scraper import BaseScraper, BiosRequirement
except ImportError:
    from base_scraper import BaseScraper, BiosRequirement

PLATFORM_NAME = "myplatform"

SOURCE_URL = "https://raw.githubusercontent.com/org/repo/main/bios_list.json"

SLUG_MAP: dict[str, str] = {
    "psx": "sony-playstation",
    "saturn": "sega-saturn",
}


class Scraper(BaseScraper):
    def __init__(self, url: str = SOURCE_URL):
        super().__init__(url=url)

    def fetch_requirements(self) -> list[BiosRequirement]:
        raw = self._fetch_raw()
        if not self.validate_format(raw):
            raise ValueError("Format validation failed")

        data = json.loads(raw)
        requirements = []
        for entry in data:
            system = SLUG_MAP.get(entry["platform"])
            if not system:
                print(f"Warning: unmapped '{entry['platform']}'", file=sys.stderr)
                continue
            requirements.append(BiosRequirement(
                name=entry["filename"],
                system=system,
                md5=entry.get("md5"),
                sha1=entry.get("sha1"),
                size=entry.get("size"),
                destination=entry["filename"],
                required=entry.get("required", True),
            ))
        return requirements

    def validate_format(self, raw_data: str) -> bool:
        try:
            data = json.loads(raw_data)
        except (json.JSONDecodeError, TypeError):
            return False
        return isinstance(data, list) and len(data) > 0


def main():
    from scripts.scraper.base_scraper import scraper_cli
    scraper_cli(Scraper, "Scrape MyPlatform BIOS requirements")

if __name__ == "__main__":
    main()

Parsing different upstream formats

Each platform stores its BIOS requirements differently. The scraper's job is to normalize them into BiosRequirement entries.

Format	Example	Parsing approach
JSON	RomM `known_bios_files.json`	`json.loads()`, iterate keys
XML	Recalbox `es_bios.xml`	`xml.etree.ElementTree`, xpath or iter
clrmamepro DAT	RetroArch `System.dat`	Use `dat_parser` module (see below)
Python dict	Batocera `batocera-systems`	`ast.literal_eval` or regex extraction
Bash script	EmuDeck `checkBIOS.sh`	Line-by-line regex parsing
C# source	BizHawk `FirmwareDatabase.cs`	Regex for method calls and string literals
C source	MAME/FBNeo drivers	Use `mame_parser` or `fbneo_parser` (see below)
JSON (GitHub API)	RetroDECK component manifests	`json.loads()` per manifest file

System ID mapping

Every scraper maintains a SLUG_MAP (or equivalent) that translates the platform's native system identifiers to retrobios system IDs. The retrobios system ID format is manufacturer-console in lowercase with hyphens (e.g. sony-playstation, sega-mega-drive, nintendo-gba).

When a native slug has no mapping, print a warning to stderr. This surfaces new systems added upstream that need to be mapped.

System ID consistency matters for cross-platform operations. The same console must use the same ID across all scrapers and platforms. Before inventing a new ID, check existing profiles and platform YAMLs for precedent. The canonical format is manufacturer-console in lowercase with hyphens (e.g., sony-playstation, sega-mega-drive). The SYSTEM_ALIASES dict in common.py maps common variations to canonical IDs.

Hash normalization

Normalize all hashes to lowercase hex strings.
Handle missing hashes gracefully (set to None, not empty string).
Some platforms provide multiple hash types per entry. Populate whichever fields are available.
Batocera uses 29-character truncated MD5 hashes in some entries. The resolution layer handles prefix matching, but the scraper should store the hash as-is.

Scraped data reflects what the upstream declares, which may not match reality. The scraper's job is faithful transcription of upstream data, not correction. Corrections happen in the emulator profiles (source-verified) and in _shared.yml (curated). If a scraper detects an obviously wrong hash or filename, log a warning but still include the upstream value. The divergence will surface during truth diffing.

Creating a target scraper

Target scrapers determine which emulator cores are available on each hardware target. They live in scripts/scraper/targets/ and follow the same auto-discovery pattern (*_targets_scraper.py).

BaseTargetScraper ABC

BaseTargetScraper
  __init__(url: str)
  fetch_targets() -> dict    # abstract: return target data
  write_output(data, path)   # write YAML to disk

Output format

fetch_targets() must return a dict with this structure:

{
    "platform": "myplatform",
    "source": "https://...",
    "scraped_at": "2026-03-30T12:00:00Z",
    "targets": {
        "x86_64": {
            "architecture": "x86_64",
            "cores": ["beetle_psx", "dolphin", "snes9x"],
        },
        "rpi4": {
            "architecture": "aarch64",
            "cores": ["pcsx_rearmed", "mgba"],
        },
    },
}

The targets dict maps target names to their available cores. Core names must match the names used in emulator profile cores: fields for the target filtering pipeline to work correctly.

Overrides

platforms/targets/_overrides.yml provides post-scrape adjustments:

aliases: map alternate target names to canonical ones (e.g. arm64 -> aarch64)
add_cores: cores present on a target but not detected by the scraper
remove_cores: cores detected by the scraper but not actually functional

Overrides are applied by load_target_config() in common.py after loading the scraped data. The scraper itself does not need to handle overrides.

Module contract

from scripts.scraper.targets import BaseTargetScraper

PLATFORM_NAME = "myplatform_targets"

class Scraper(BaseTargetScraper):
    def __init__(self):
        super().__init__(url="https://...")

    def fetch_targets(self) -> dict:
        # Fetch and parse target data
        ...

  myplatform:
    target_scraper: myplatform_targets
    target_source: https://...

Existing target scrapers

Scraper	Source	Approach
`retroarch_targets`	libretro buildbot nightly	Scrape directory listings for each target arch
`batocera_targets`	Config.in + es_systems.yml	Cross-reference kernel config with system definitions
`emudeck_targets`	EmuScripts + RetroArch cores	GitHub API for script availability per OS
`retropie_targets`	scriptmodules + rp_module_flags	Parse Bash scriptmodules for platform flags

Parser modules

Shared parsers in scripts/scraper/ handle formats used by multiple scrapers or formats complex enough to warrant dedicated parsing logic.

dat_parser

Parses clrmamepro DAT format as used in RetroArch's System.dat:

game (
    name "System"
    comment "Platform Name"
    rom ( name filename size 12345 crc ABCD1234 md5 ... sha1 ... )
)

Produces DatRom dataclass instances with name, size, crc32, md5, sha1, and system fields. The libretro_scraper uses this parser.

mame_parser

Parses MAME C source files to extract BIOS root sets. Handles:

Machine declaration macros: GAME, SYST, COMP, CONS
MACHINE_IS_BIOS_ROOT flag detection
ROM_START/ROM_END blocks
ROM_LOAD variants and ROM_REGION declarations
ROM_SYSTEM_BIOS entries
NO_DUMP filtering and BAD_DUMP flagging

Used by mame_hash_scraper to auto-fetch BIOS hashes from MAME driver sources for each tagged MAME version.

fbneo_parser

Parses FBNeo C source files:

BurnRomInfo structs (static ROM arrays with name, size, CRC)
BurnDriver structs (driver registration with BDF_BOARDROM flag)
BIOS set identification via the boardrom flag

Used by fbneo_hash_scraper to extract BIOS ROM definitions.

_hash_merge

Text-based YAML patching that merges fetched hash data into emulator profiles while preserving formatting. Two strategies:

MAME: updates bios_zip entries with contents lists (name, size, CRC32)
FBNeo: updates individual ROM entries grouped by archive field

The merge preserves fields the hash data does not generate (system, note, required) and leaves entries not present in the hash data untouched. Uses text-level YAML manipulation rather than load-dump to maintain human-readable formatting.

Testing

Development workflow

Start with --dry-run. It's helpful to preview before writing output:
```
python -m scripts.scraper.myplatform_scraper --dry-run
```

Check JSON output for data quality:

python -m scripts.scraper.myplatform_scraper --json | python -m json.tool | head -50

Compare with existing YAML if updating a scraper:

python -m scripts.scraper.myplatform_scraper --output /tmp/test.yml
diff platforms/myplatform.yml /tmp/test.yml

Run verification after generating:

python scripts/verify.py --platform myplatform
python scripts/verify.py --platform myplatform --verbose

Run the full pipeline before committing:
```
python scripts/pipeline.py --offline
```

Round-trip testing

If an exporter exists for the platform, validate the scrape-export-compare cycle:

# Scrape upstream -> platform YAML
python -m scripts.scraper.myplatform_scraper --output /tmp/scraped.yml

# Export truth data -> native format
python scripts/export_native.py --platform myplatform --output /tmp/exported.json

# Compare
diff <(python -m scripts.scraper.myplatform_scraper --json | python -m json.tool) \
     /tmp/exported.json

Common issues

Symptom	Cause	Fix
Unmapped slug warnings	New system added upstream	Add mapping to `SLUG_MAP`
Empty requirements list	Upstream format changed	Check `validate_format()`, update parser
Hash mismatch in verify	Upstream updated hashes	Re-scrape and regenerate platform YAML
Scraper hangs	URL unreachable, no timeout	`_fetch_raw()` has 30s timeout; check URL
`Response exceeds 50 MB`	Upstream file grew	Investigate; may need chunked parsing
`validate_format` fails	Upstream restructured	Update both `validate_format` and `fetch_requirements`

E2E tests

The project's test suite (tests/test_e2e.py) covers scraper integration at the pipeline level. When adding a new scraper, verify that the full pipeline passes:

python -m unittest tests.test_e2e
python scripts/pipeline.py --offline

Both must pass before the scraper is considered complete.

14 KiB Raw Blame History