Files
libretro/wiki/adding-a-scraper.md
Abdessamad Derraz d0dd05ddf6 docs: add wiki pages for all audiences, fix .old.yml leak
9 new wiki pages: getting-started, faq, troubleshooting,
advanced-usage, verification-modes, adding-a-platform,
adding-a-scraper, testing-guide, release-process.

Updated architecture.md with mermaid diagrams, tools.md with
full pipeline and target/exporter sections, profiling.md with
missing fields, index.md with glossary and nav links.

Expanded CONTRIBUTING.md from stub to full contributor guide.

Filter .old.yml from load_emulator_profiles, generate_db alias
collection, and generate_readme counts. Fix BizHawk sha1 mode
in tools.md, fix RetroPie path, fix export_truth.py typos.
2026-03-30 23:58:12 +02:00

14 KiB

Adding a scraper

How to create or modify a scraper for fetching BIOS requirements from upstream platform sources.

Scraper architecture

Plugin discovery

Scrapers are discovered automatically at import time. The scripts/scraper/__init__.py module uses pkgutil.iter_modules to scan for files matching *_scraper.py in the scraper directory. Each module must export:

  • PLATFORM_NAME: str - the platform identifier (matches _registry.yml)
  • Scraper: class - a subclass of BaseScraper

No registration code is needed. Drop a file, export the two names, and it works.

# scripts/scraper/__init__.py (simplified)
for finder, name, ispkg in pkgutil.iter_modules([package_dir]):
    if not name.endswith("_scraper"):
        continue
    module = importlib.import_module(f".{name}", package=__package__)
    # looks for PLATFORM_NAME and Scraper attributes

BaseScraper ABC

BaseScraper (scripts/scraper/base_scraper.py) provides the foundation:

BaseScraper
  __init__(url: str)
  _fetch_raw() -> str               # HTTP GET, cached, 50 MB limit
  fetch_requirements() -> list       # abstract: parse upstream data
  validate_format(raw_data) -> bool  # abstract: detect format changes
  compare_with_config(config) -> ChangeSet  # diff against existing YAML
  test_connection() -> bool          # reachability check

_fetch_raw() handles HTTP with urllib.request, sets a retrobios-scraper/1.0 user-agent, enforces a 30-second timeout, and reads the response in 64 KB chunks with a 50 MB hard limit to prevent memory exhaustion. The result is cached on the instance after the first call.

BiosRequirement

A dataclass representing a single BIOS file entry:

@dataclass
class BiosRequirement:
    name: str                        # filename
    system: str                      # retrobios system ID
    sha1: str | None = None
    md5: str | None = None
    crc32: str | None = None
    size: int | None = None
    destination: str = ""            # relative path in BIOS dir
    required: bool = True
    zipped_file: str | None = None   # ROM name inside a ZIP
    native_id: str | None = None     # original system name

ChangeSet

Returned by compare_with_config(). Contains:

  • added: list[BiosRequirement] - new files not in the existing config
  • removed: list[BiosRequirement] - files present in config but gone upstream
  • modified: list[tuple[BiosRequirement, BiosRequirement]] - hash changes
  • has_changes: bool - True if any of the above are non-empty
  • summary() -> str - human-readable summary (e.g. +3 added, ~1 modified)

scraper_cli

scraper_cli(scraper_class, description) provides a shared CLI with three modes:

Flag Behavior
--dry-run Fetch and print a summary grouped by system
--json Output all requirements as JSON
--output FILE Write platform YAML to FILE

When --output targets an existing file, the CLI preserves keys not generated by the scraper (e.g. data_directories, manual additions). Only the systems section is replaced. If the scraper defines generate_platform_yaml(), that method is used instead of the generic YAML builder.

Helper functions

Two additional functions in base_scraper.py:

  • fetch_github_latest_version(repo) - fetches the latest release tag via GitHub API.
  • fetch_github_latest_tag(repo, prefix) - fetches the most recent tag matching an optional prefix.

Creating a BIOS scraper

Minimal example

Based on the RomM scraper pattern (JSON source, flat structure):

"""Scraper for MyPlatform BIOS requirements."""
from __future__ import annotations

import json
import sys

try:
    from .base_scraper import BaseScraper, BiosRequirement
except ImportError:
    from base_scraper import BaseScraper, BiosRequirement

PLATFORM_NAME = "myplatform"

SOURCE_URL = "https://raw.githubusercontent.com/org/repo/main/bios_list.json"

SLUG_MAP: dict[str, str] = {
    "psx": "sony-playstation",
    "saturn": "sega-saturn",
}


class Scraper(BaseScraper):
    def __init__(self, url: str = SOURCE_URL):
        super().__init__(url=url)

    def fetch_requirements(self) -> list[BiosRequirement]:
        raw = self._fetch_raw()
        if not self.validate_format(raw):
            raise ValueError("Format validation failed")

        data = json.loads(raw)
        requirements = []
        for entry in data:
            system = SLUG_MAP.get(entry["platform"])
            if not system:
                print(f"Warning: unmapped '{entry['platform']}'", file=sys.stderr)
                continue
            requirements.append(BiosRequirement(
                name=entry["filename"],
                system=system,
                md5=entry.get("md5"),
                sha1=entry.get("sha1"),
                size=entry.get("size"),
                destination=entry["filename"],
                required=entry.get("required", True),
            ))
        return requirements

    def validate_format(self, raw_data: str) -> bool:
        try:
            data = json.loads(raw_data)
        except (json.JSONDecodeError, TypeError):
            return False
        return isinstance(data, list) and len(data) > 0


def main():
    from scripts.scraper.base_scraper import scraper_cli
    scraper_cli(Scraper, "Scrape MyPlatform BIOS requirements")

if __name__ == "__main__":
    main()

Parsing different upstream formats

Each platform stores its BIOS requirements differently. The scraper's job is to normalize them into BiosRequirement entries.

Format Example Parsing approach
JSON RomM known_bios_files.json json.loads(), iterate keys
XML Recalbox es_bios.xml xml.etree.ElementTree, xpath or iter
clrmamepro DAT RetroArch System.dat Use dat_parser module (see below)
Python dict Batocera batocera-systems ast.literal_eval or regex extraction
Bash script EmuDeck checkBIOS.sh Line-by-line regex parsing
C# source BizHawk FirmwareDatabase.cs Regex for method calls and string literals
C source MAME/FBNeo drivers Use mame_parser or fbneo_parser (see below)
JSON (GitHub API) RetroDECK component manifests json.loads() per manifest file

System ID mapping

Every scraper maintains a SLUG_MAP (or equivalent) that translates the platform's native system identifiers to retrobios system IDs. The retrobios system ID format is manufacturer-console in lowercase with hyphens (e.g. sony-playstation, sega-mega-drive, nintendo-gba).

When a native slug has no mapping, print a warning to stderr. This surfaces new systems added upstream that need to be mapped.

System ID consistency matters for cross-platform operations. The same console must use the same ID across all scrapers and platforms. Before inventing a new ID, check existing profiles and platform YAMLs for precedent. The canonical format is manufacturer-console in lowercase with hyphens (e.g., sony-playstation, sega-mega-drive). The SYSTEM_ALIASES dict in common.py maps common variations to canonical IDs.

Hash normalization

  • Normalize all hashes to lowercase hex strings.
  • Handle missing hashes gracefully (set to None, not empty string).
  • Some platforms provide multiple hash types per entry. Populate whichever fields are available.
  • Batocera uses 29-character truncated MD5 hashes in some entries. The resolution layer handles prefix matching, but the scraper should store the hash as-is.

Scraped data reflects what the upstream declares, which may not match reality. The scraper's job is faithful transcription of upstream data, not correction. Corrections happen in the emulator profiles (source-verified) and in _shared.yml (curated). If a scraper detects an obviously wrong hash or filename, log a warning but still include the upstream value. The divergence will surface during truth diffing.

Creating a target scraper

Target scrapers determine which emulator cores are available on each hardware target. They live in scripts/scraper/targets/ and follow the same auto-discovery pattern (*_targets_scraper.py).

BaseTargetScraper ABC

BaseTargetScraper
  __init__(url: str)
  fetch_targets() -> dict    # abstract: return target data
  write_output(data, path)   # write YAML to disk

Output format

fetch_targets() must return a dict with this structure:

{
    "platform": "myplatform",
    "source": "https://...",
    "scraped_at": "2026-03-30T12:00:00Z",
    "targets": {
        "x86_64": {
            "architecture": "x86_64",
            "cores": ["beetle_psx", "dolphin", "snes9x"],
        },
        "rpi4": {
            "architecture": "aarch64",
            "cores": ["pcsx_rearmed", "mgba"],
        },
    },
}

The targets dict maps target names to their available cores. Core names must match the names used in emulator profile cores: fields for the target filtering pipeline to work correctly.

Overrides

platforms/targets/_overrides.yml provides post-scrape adjustments:

  • aliases: map alternate target names to canonical ones (e.g. arm64 -> aarch64)
  • add_cores: cores present on a target but not detected by the scraper
  • remove_cores: cores detected by the scraper but not actually functional

Overrides are applied by load_target_config() in common.py after loading the scraped data. The scraper itself does not need to handle overrides.

Module contract

from scripts.scraper.targets import BaseTargetScraper

PLATFORM_NAME = "myplatform_targets"

class Scraper(BaseTargetScraper):
    def __init__(self):
        super().__init__(url="https://...")

    def fetch_targets(self) -> dict:
        # Fetch and parse target data
        ...

Register the target scraper in _registry.yml:

  myplatform:
    target_scraper: myplatform_targets
    target_source: https://...

Existing target scrapers

Scraper Source Approach
retroarch_targets libretro buildbot nightly Scrape directory listings for each target arch
batocera_targets Config.in + es_systems.yml Cross-reference kernel config with system definitions
emudeck_targets EmuScripts + RetroArch cores GitHub API for script availability per OS
retropie_targets scriptmodules + rp_module_flags Parse Bash scriptmodules for platform flags

Parser modules

Shared parsers in scripts/scraper/ handle formats used by multiple scrapers or formats complex enough to warrant dedicated parsing logic.

dat_parser

Parses clrmamepro DAT format as used in RetroArch's System.dat:

game (
    name "System"
    comment "Platform Name"
    rom ( name filename size 12345 crc ABCD1234 md5 ... sha1 ... )
)

Produces DatRom dataclass instances with name, size, crc32, md5, sha1, and system fields. The libretro_scraper uses this parser.

mame_parser

Parses MAME C source files to extract BIOS root sets. Handles:

  • Machine declaration macros: GAME, SYST, COMP, CONS
  • MACHINE_IS_BIOS_ROOT flag detection
  • ROM_START/ROM_END blocks
  • ROM_LOAD variants and ROM_REGION declarations
  • ROM_SYSTEM_BIOS entries
  • NO_DUMP filtering and BAD_DUMP flagging

Used by mame_hash_scraper to auto-fetch BIOS hashes from MAME driver sources for each tagged MAME version.

fbneo_parser

Parses FBNeo C source files:

  • BurnRomInfo structs (static ROM arrays with name, size, CRC)
  • BurnDriver structs (driver registration with BDF_BOARDROM flag)
  • BIOS set identification via the boardrom flag

Used by fbneo_hash_scraper to extract BIOS ROM definitions.

_hash_merge

Text-based YAML patching that merges fetched hash data into emulator profiles while preserving formatting. Two strategies:

  • MAME: updates bios_zip entries with contents lists (name, size, CRC32)
  • FBNeo: updates individual ROM entries grouped by archive field

The merge preserves fields the hash data does not generate (system, note, required) and leaves entries not present in the hash data untouched. Uses text-level YAML manipulation rather than load-dump to maintain human-readable formatting.

Testing

Development workflow

  1. Start with --dry-run. It's helpful to preview before writing output:

    python -m scripts.scraper.myplatform_scraper --dry-run
    
  2. Check JSON output for data quality:

    python -m scripts.scraper.myplatform_scraper --json | python -m json.tool | head -50
    
  3. Compare with existing YAML if updating a scraper:

    python -m scripts.scraper.myplatform_scraper --output /tmp/test.yml
    diff platforms/myplatform.yml /tmp/test.yml
    
  4. Run verification after generating:

    python scripts/verify.py --platform myplatform
    python scripts/verify.py --platform myplatform --verbose
    
  5. Run the full pipeline before committing:

    python scripts/pipeline.py --offline
    

Round-trip testing

If an exporter exists for the platform, validate the scrape-export-compare cycle:

# Scrape upstream -> platform YAML
python -m scripts.scraper.myplatform_scraper --output /tmp/scraped.yml

# Export truth data -> native format
python scripts/export_native.py --platform myplatform --output /tmp/exported.json

# Compare
diff <(python -m scripts.scraper.myplatform_scraper --json | python -m json.tool) \
     /tmp/exported.json

Common issues

Symptom Cause Fix
Unmapped slug warnings New system added upstream Add mapping to SLUG_MAP
Empty requirements list Upstream format changed Check validate_format(), update parser
Hash mismatch in verify Upstream updated hashes Re-scrape and regenerate platform YAML
Scraper hangs URL unreachable, no timeout _fetch_raw() has 30s timeout; check URL
Response exceeds 50 MB Upstream file grew Investigate; may need chunked parsing
validate_format fails Upstream restructured Update both validate_format and fetch_requirements

E2E tests

The project's test suite (tests/test_e2e.py) covers scraper integration at the pipeline level. When adding a new scraper, verify that the full pipeline passes:

python -m unittest tests.test_e2e
python scripts/pipeline.py --offline

Both must pass before the scraper is considered complete.