# Adding a scraper How to create or modify a scraper for fetching BIOS requirements from upstream platform sources. ## Scraper architecture ### Plugin discovery Scrapers are discovered automatically at import time. The `scripts/scraper/__init__.py` module uses `pkgutil.iter_modules` to scan for files matching `*_scraper.py` in the scraper directory. Each module must export: - `PLATFORM_NAME: str` - the platform identifier (matches `_registry.yml`) - `Scraper: class` - a subclass of `BaseScraper` No registration code is needed. Drop a file, export the two names, and it works. ```python # scripts/scraper/__init__.py (simplified) for finder, name, ispkg in pkgutil.iter_modules([package_dir]): if not name.endswith("_scraper"): continue module = importlib.import_module(f".{name}", package=__package__) # looks for PLATFORM_NAME and Scraper attributes ``` ### BaseScraper ABC `BaseScraper` (`scripts/scraper/base_scraper.py`) provides the foundation: ``` BaseScraper __init__(url: str) _fetch_raw() -> str # HTTP GET, cached, 50 MB limit fetch_requirements() -> list # abstract: parse upstream data validate_format(raw_data) -> bool # abstract: detect format changes compare_with_config(config) -> ChangeSet # diff against existing YAML test_connection() -> bool # reachability check ``` `_fetch_raw()` handles HTTP with `urllib.request`, sets a `retrobios-scraper/1.0` user-agent, enforces a 30-second timeout, and reads the response in 64 KB chunks with a 50 MB hard limit to prevent memory exhaustion. The result is cached on the instance after the first call. ### BiosRequirement A dataclass representing a single BIOS file entry: ```python @dataclass class BiosRequirement: name: str # filename system: str # retrobios system ID sha1: str | None = None md5: str | None = None crc32: str | None = None size: int | None = None destination: str = "" # relative path in BIOS dir required: bool = True zipped_file: str | None = None # ROM name inside a ZIP native_id: str | None = None # original system name ``` ### ChangeSet Returned by `compare_with_config()`. Contains: - `added: list[BiosRequirement]` - new files not in the existing config - `removed: list[BiosRequirement]` - files present in config but gone upstream - `modified: list[tuple[BiosRequirement, BiosRequirement]]` - hash changes - `has_changes: bool` - True if any of the above are non-empty - `summary() -> str` - human-readable summary (e.g. `+3 added, ~1 modified`) ### scraper_cli `scraper_cli(scraper_class, description)` provides a shared CLI with three modes: | Flag | Behavior | |------|----------| | `--dry-run` | Fetch and print a summary grouped by system | | `--json` | Output all requirements as JSON | | `--output FILE` | Write platform YAML to FILE | When `--output` targets an existing file, the CLI preserves keys not generated by the scraper (e.g. `data_directories`, manual additions). Only the `systems` section is replaced. If the scraper defines `generate_platform_yaml()`, that method is used instead of the generic YAML builder. ### Helper functions Two additional functions in `base_scraper.py`: - `fetch_github_latest_version(repo)` - fetches the latest release tag via GitHub API. - `fetch_github_latest_tag(repo, prefix)` - fetches the most recent tag matching an optional prefix. ## Creating a BIOS scraper ### Minimal example Based on the RomM scraper pattern (JSON source, flat structure): ```python """Scraper for MyPlatform BIOS requirements.""" from __future__ import annotations import json import sys try: from .base_scraper import BaseScraper, BiosRequirement except ImportError: from base_scraper import BaseScraper, BiosRequirement PLATFORM_NAME = "myplatform" SOURCE_URL = "https://raw.githubusercontent.com/org/repo/main/bios_list.json" SLUG_MAP: dict[str, str] = { "psx": "sony-playstation", "saturn": "sega-saturn", } class Scraper(BaseScraper): def __init__(self, url: str = SOURCE_URL): super().__init__(url=url) def fetch_requirements(self) -> list[BiosRequirement]: raw = self._fetch_raw() if not self.validate_format(raw): raise ValueError("Format validation failed") data = json.loads(raw) requirements = [] for entry in data: system = SLUG_MAP.get(entry["platform"]) if not system: print(f"Warning: unmapped '{entry['platform']}'", file=sys.stderr) continue requirements.append(BiosRequirement( name=entry["filename"], system=system, md5=entry.get("md5"), sha1=entry.get("sha1"), size=entry.get("size"), destination=entry["filename"], required=entry.get("required", True), )) return requirements def validate_format(self, raw_data: str) -> bool: try: data = json.loads(raw_data) except (json.JSONDecodeError, TypeError): return False return isinstance(data, list) and len(data) > 0 def main(): from scripts.scraper.base_scraper import scraper_cli scraper_cli(Scraper, "Scrape MyPlatform BIOS requirements") if __name__ == "__main__": main() ``` ### Parsing different upstream formats Each platform stores its BIOS requirements differently. The scraper's job is to normalize them into `BiosRequirement` entries. | Format | Example | Parsing approach | |--------|---------|-----------------| | JSON | RomM `known_bios_files.json` | `json.loads()`, iterate keys | | XML | Recalbox `es_bios.xml` | `xml.etree.ElementTree`, xpath or iter | | clrmamepro DAT | RetroArch `System.dat` | Use `dat_parser` module (see below) | | Python dict | Batocera `batocera-systems` | `ast.literal_eval` or regex extraction | | Bash script | EmuDeck `checkBIOS.sh` | Line-by-line regex parsing | | C# source | BizHawk `FirmwareDatabase.cs` | Regex for method calls and string literals | | C source | MAME/FBNeo drivers | Use `mame_parser` or `fbneo_parser` (see below) | | JSON (GitHub API) | RetroDECK component manifests | `json.loads()` per manifest file | ### System ID mapping Every scraper maintains a `SLUG_MAP` (or equivalent) that translates the platform's native system identifiers to retrobios system IDs. The retrobios system ID format is `manufacturer-console` in lowercase with hyphens (e.g. `sony-playstation`, `sega-mega-drive`, `nintendo-gba`). When a native slug has no mapping, print a warning to stderr. This surfaces new systems added upstream that need to be mapped. System ID consistency matters for cross-platform operations. The same console must use the same ID across all scrapers and platforms. Before inventing a new ID, check existing profiles and platform YAMLs for precedent. The canonical format is `manufacturer-console` in lowercase with hyphens (e.g., `sony-playstation`, `sega-mega-drive`). The `SYSTEM_ALIASES` dict in `common.py` maps common variations to canonical IDs. ### Hash normalization - Normalize all hashes to lowercase hex strings. - Handle missing hashes gracefully (set to `None`, not empty string). - Some platforms provide multiple hash types per entry. Populate whichever fields are available. - Batocera uses 29-character truncated MD5 hashes in some entries. The resolution layer handles prefix matching, but the scraper should store the hash as-is. Scraped data reflects what the upstream declares, which may not match reality. The scraper's job is faithful transcription of upstream data, not correction. Corrections happen in the emulator profiles (source-verified) and in `_shared.yml` (curated). If a scraper detects an obviously wrong hash or filename, log a warning but still include the upstream value. The divergence will surface during truth diffing. ## Creating a target scraper Target scrapers determine which emulator cores are available on each hardware target. They live in `scripts/scraper/targets/` and follow the same auto-discovery pattern (`*_targets_scraper.py`). ### BaseTargetScraper ABC ``` BaseTargetScraper __init__(url: str) fetch_targets() -> dict # abstract: return target data write_output(data, path) # write YAML to disk ``` ### Output format `fetch_targets()` must return a dict with this structure: ```python { "platform": "myplatform", "source": "https://...", "scraped_at": "2026-03-30T12:00:00Z", "targets": { "x86_64": { "architecture": "x86_64", "cores": ["beetle_psx", "dolphin", "snes9x"], }, "rpi4": { "architecture": "aarch64", "cores": ["pcsx_rearmed", "mgba"], }, }, } ``` The `targets` dict maps target names to their available cores. Core names must match the names used in emulator profile `cores:` fields for the target filtering pipeline to work correctly. ### Overrides `platforms/targets/_overrides.yml` provides post-scrape adjustments: - **aliases**: map alternate target names to canonical ones (e.g. `arm64` -> `aarch64`) - **add_cores**: cores present on a target but not detected by the scraper - **remove_cores**: cores detected by the scraper but not actually functional Overrides are applied by `load_target_config()` in `common.py` after loading the scraped data. The scraper itself does not need to handle overrides. ### Module contract ```python from scripts.scraper.targets import BaseTargetScraper PLATFORM_NAME = "myplatform_targets" class Scraper(BaseTargetScraper): def __init__(self): super().__init__(url="https://...") def fetch_targets(self) -> dict: # Fetch and parse target data ... ``` Register the target scraper in `_registry.yml`: ```yaml myplatform: target_scraper: myplatform_targets target_source: https://... ``` ### Existing target scrapers | Scraper | Source | Approach | |---------|--------|----------| | `retroarch_targets` | libretro buildbot nightly | Scrape directory listings for each target arch | | `batocera_targets` | Config.in + es_systems.yml | Cross-reference kernel config with system definitions | | `emudeck_targets` | EmuScripts + RetroArch cores | GitHub API for script availability per OS | | `retropie_targets` | scriptmodules + rp_module_flags | Parse Bash scriptmodules for platform flags | ## Parser modules Shared parsers in `scripts/scraper/` handle formats used by multiple scrapers or formats complex enough to warrant dedicated parsing logic. ### dat_parser Parses clrmamepro DAT format as used in RetroArch's `System.dat`: ``` game ( name "System" comment "Platform Name" rom ( name filename size 12345 crc ABCD1234 md5 ... sha1 ... ) ) ``` Produces `DatRom` dataclass instances with `name`, `size`, `crc32`, `md5`, `sha1`, and `system` fields. The `libretro_scraper` uses this parser. ### mame_parser Parses MAME C source files to extract BIOS root sets. Handles: - Machine declaration macros: `GAME`, `SYST`, `COMP`, `CONS` - `MACHINE_IS_BIOS_ROOT` flag detection - `ROM_START`/`ROM_END` blocks - `ROM_LOAD` variants and `ROM_REGION` declarations - `ROM_SYSTEM_BIOS` entries - `NO_DUMP` filtering and `BAD_DUMP` flagging Used by `mame_hash_scraper` to auto-fetch BIOS hashes from MAME driver sources for each tagged MAME version. ### fbneo_parser Parses FBNeo C source files: - `BurnRomInfo` structs (static ROM arrays with name, size, CRC) - `BurnDriver` structs (driver registration with `BDF_BOARDROM` flag) - BIOS set identification via the boardrom flag Used by `fbneo_hash_scraper` to extract BIOS ROM definitions. ### _hash_merge Text-based YAML patching that merges fetched hash data into emulator profiles while preserving formatting. Two strategies: - **MAME**: updates `bios_zip` entries with `contents` lists (name, size, CRC32) - **FBNeo**: updates individual ROM entries grouped by `archive` field The merge preserves fields the hash data does not generate (system, note, required) and leaves entries not present in the hash data untouched. Uses text-level YAML manipulation rather than load-dump to maintain human-readable formatting. ## Testing ### Development workflow 1. **Start with --dry-run**. It's helpful to preview before writing output: ```bash python -m scripts.scraper.myplatform_scraper --dry-run ``` 2. **Check JSON output** for data quality: ```bash python -m scripts.scraper.myplatform_scraper --json | python -m json.tool | head -50 ``` 3. **Compare with existing YAML** if updating a scraper: ```bash python -m scripts.scraper.myplatform_scraper --output /tmp/test.yml diff platforms/myplatform.yml /tmp/test.yml ``` 4. **Run verification** after generating: ```bash python scripts/verify.py --platform myplatform python scripts/verify.py --platform myplatform --verbose ``` 5. **Run the full pipeline** before committing: ```bash python scripts/pipeline.py --offline ``` ### Round-trip testing If an exporter exists for the platform, validate the scrape-export-compare cycle: ```bash # Scrape upstream -> platform YAML python -m scripts.scraper.myplatform_scraper --output /tmp/scraped.yml # Export truth data -> native format python scripts/export_native.py --platform myplatform --output /tmp/exported.json # Compare diff <(python -m scripts.scraper.myplatform_scraper --json | python -m json.tool) \ /tmp/exported.json ``` ### Common issues | Symptom | Cause | Fix | |---------|-------|-----| | Unmapped slug warnings | New system added upstream | Add mapping to `SLUG_MAP` | | Empty requirements list | Upstream format changed | Check `validate_format()`, update parser | | Hash mismatch in verify | Upstream updated hashes | Re-scrape and regenerate platform YAML | | Scraper hangs | URL unreachable, no timeout | `_fetch_raw()` has 30s timeout; check URL | | `Response exceeds 50 MB` | Upstream file grew | Investigate; may need chunked parsing | | `validate_format` fails | Upstream restructured | Update both `validate_format` and `fetch_requirements` | ### E2E tests The project's test suite (`tests/test_e2e.py`) covers scraper integration at the pipeline level. When adding a new scraper, verify that the full pipeline passes: ```bash python -m unittest tests.test_e2e python scripts/pipeline.py --offline ``` Both must pass before the scraper is considered complete.