9 new wiki pages: getting-started, faq, troubleshooting, advanced-usage, verification-modes, adding-a-platform, adding-a-scraper, testing-guide, release-process. Updated architecture.md with mermaid diagrams, tools.md with full pipeline and target/exporter sections, profiling.md with missing fields, index.md with glossary and nav links. Expanded CONTRIBUTING.md from stub to full contributor guide. Filter .old.yml from load_emulator_profiles, generate_db alias collection, and generate_readme counts. Fix BizHawk sha1 mode in tools.md, fix RetroPie path, fix export_truth.py typos.
14 KiB
Adding a scraper
How to create or modify a scraper for fetching BIOS requirements from upstream platform sources.
Scraper architecture
Plugin discovery
Scrapers are discovered automatically at import time. The scripts/scraper/__init__.py
module uses pkgutil.iter_modules to scan for files matching *_scraper.py in
the scraper directory. Each module must export:
PLATFORM_NAME: str- the platform identifier (matches_registry.yml)Scraper: class- a subclass ofBaseScraper
No registration code is needed. Drop a file, export the two names, and it works.
# scripts/scraper/__init__.py (simplified)
for finder, name, ispkg in pkgutil.iter_modules([package_dir]):
if not name.endswith("_scraper"):
continue
module = importlib.import_module(f".{name}", package=__package__)
# looks for PLATFORM_NAME and Scraper attributes
BaseScraper ABC
BaseScraper (scripts/scraper/base_scraper.py) provides the foundation:
BaseScraper
__init__(url: str)
_fetch_raw() -> str # HTTP GET, cached, 50 MB limit
fetch_requirements() -> list # abstract: parse upstream data
validate_format(raw_data) -> bool # abstract: detect format changes
compare_with_config(config) -> ChangeSet # diff against existing YAML
test_connection() -> bool # reachability check
_fetch_raw() handles HTTP with urllib.request, sets a retrobios-scraper/1.0
user-agent, enforces a 30-second timeout, and reads the response in 64 KB chunks
with a 50 MB hard limit to prevent memory exhaustion. The result is cached on the
instance after the first call.
BiosRequirement
A dataclass representing a single BIOS file entry:
@dataclass
class BiosRequirement:
name: str # filename
system: str # retrobios system ID
sha1: str | None = None
md5: str | None = None
crc32: str | None = None
size: int | None = None
destination: str = "" # relative path in BIOS dir
required: bool = True
zipped_file: str | None = None # ROM name inside a ZIP
native_id: str | None = None # original system name
ChangeSet
Returned by compare_with_config(). Contains:
added: list[BiosRequirement]- new files not in the existing configremoved: list[BiosRequirement]- files present in config but gone upstreammodified: list[tuple[BiosRequirement, BiosRequirement]]- hash changeshas_changes: bool- True if any of the above are non-emptysummary() -> str- human-readable summary (e.g.+3 added, ~1 modified)
scraper_cli
scraper_cli(scraper_class, description) provides a shared CLI with three modes:
| Flag | Behavior |
|---|---|
--dry-run |
Fetch and print a summary grouped by system |
--json |
Output all requirements as JSON |
--output FILE |
Write platform YAML to FILE |
When --output targets an existing file, the CLI preserves keys not generated by
the scraper (e.g. data_directories, manual additions). Only the systems section
is replaced. If the scraper defines generate_platform_yaml(), that method is used
instead of the generic YAML builder.
Helper functions
Two additional functions in base_scraper.py:
fetch_github_latest_version(repo)- fetches the latest release tag via GitHub API.fetch_github_latest_tag(repo, prefix)- fetches the most recent tag matching an optional prefix.
Creating a BIOS scraper
Minimal example
Based on the RomM scraper pattern (JSON source, flat structure):
"""Scraper for MyPlatform BIOS requirements."""
from __future__ import annotations
import json
import sys
try:
from .base_scraper import BaseScraper, BiosRequirement
except ImportError:
from base_scraper import BaseScraper, BiosRequirement
PLATFORM_NAME = "myplatform"
SOURCE_URL = "https://raw.githubusercontent.com/org/repo/main/bios_list.json"
SLUG_MAP: dict[str, str] = {
"psx": "sony-playstation",
"saturn": "sega-saturn",
}
class Scraper(BaseScraper):
def __init__(self, url: str = SOURCE_URL):
super().__init__(url=url)
def fetch_requirements(self) -> list[BiosRequirement]:
raw = self._fetch_raw()
if not self.validate_format(raw):
raise ValueError("Format validation failed")
data = json.loads(raw)
requirements = []
for entry in data:
system = SLUG_MAP.get(entry["platform"])
if not system:
print(f"Warning: unmapped '{entry['platform']}'", file=sys.stderr)
continue
requirements.append(BiosRequirement(
name=entry["filename"],
system=system,
md5=entry.get("md5"),
sha1=entry.get("sha1"),
size=entry.get("size"),
destination=entry["filename"],
required=entry.get("required", True),
))
return requirements
def validate_format(self, raw_data: str) -> bool:
try:
data = json.loads(raw_data)
except (json.JSONDecodeError, TypeError):
return False
return isinstance(data, list) and len(data) > 0
def main():
from scripts.scraper.base_scraper import scraper_cli
scraper_cli(Scraper, "Scrape MyPlatform BIOS requirements")
if __name__ == "__main__":
main()
Parsing different upstream formats
Each platform stores its BIOS requirements differently. The scraper's job is to
normalize them into BiosRequirement entries.
| Format | Example | Parsing approach |
|---|---|---|
| JSON | RomM known_bios_files.json |
json.loads(), iterate keys |
| XML | Recalbox es_bios.xml |
xml.etree.ElementTree, xpath or iter |
| clrmamepro DAT | RetroArch System.dat |
Use dat_parser module (see below) |
| Python dict | Batocera batocera-systems |
ast.literal_eval or regex extraction |
| Bash script | EmuDeck checkBIOS.sh |
Line-by-line regex parsing |
| C# source | BizHawk FirmwareDatabase.cs |
Regex for method calls and string literals |
| C source | MAME/FBNeo drivers | Use mame_parser or fbneo_parser (see below) |
| JSON (GitHub API) | RetroDECK component manifests | json.loads() per manifest file |
System ID mapping
Every scraper maintains a SLUG_MAP (or equivalent) that translates the platform's
native system identifiers to retrobios system IDs. The retrobios system ID format
is manufacturer-console in lowercase with hyphens (e.g. sony-playstation,
sega-mega-drive, nintendo-gba).
When a native slug has no mapping, print a warning to stderr. This surfaces new systems added upstream that need to be mapped.
System ID consistency matters for cross-platform operations. The same console must use the same ID across all scrapers and platforms. Before inventing a new ID, check existing profiles and platform YAMLs for precedent. The canonical format is manufacturer-console in lowercase with hyphens (e.g., sony-playstation, sega-mega-drive). The SYSTEM_ALIASES dict in common.py maps common variations to canonical IDs.
Hash normalization
- Normalize all hashes to lowercase hex strings.
- Handle missing hashes gracefully (set to
None, not empty string). - Some platforms provide multiple hash types per entry. Populate whichever fields are available.
- Batocera uses 29-character truncated MD5 hashes in some entries. The resolution layer handles prefix matching, but the scraper should store the hash as-is.
Scraped data reflects what the upstream declares, which may not match reality. The scraper's job is faithful transcription of upstream data, not correction. Corrections happen in the emulator profiles (source-verified) and in _shared.yml (curated). If a scraper detects an obviously wrong hash or filename, log a warning but still include the upstream value. The divergence will surface during truth diffing.
Creating a target scraper
Target scrapers determine which emulator cores are available on each hardware
target. They live in scripts/scraper/targets/ and follow the same auto-discovery
pattern (*_targets_scraper.py).
BaseTargetScraper ABC
BaseTargetScraper
__init__(url: str)
fetch_targets() -> dict # abstract: return target data
write_output(data, path) # write YAML to disk
Output format
fetch_targets() must return a dict with this structure:
{
"platform": "myplatform",
"source": "https://...",
"scraped_at": "2026-03-30T12:00:00Z",
"targets": {
"x86_64": {
"architecture": "x86_64",
"cores": ["beetle_psx", "dolphin", "snes9x"],
},
"rpi4": {
"architecture": "aarch64",
"cores": ["pcsx_rearmed", "mgba"],
},
},
}
The targets dict maps target names to their available cores. Core names must
match the names used in emulator profile cores: fields for the target filtering
pipeline to work correctly.
Overrides
platforms/targets/_overrides.yml provides post-scrape adjustments:
- aliases: map alternate target names to canonical ones (e.g.
arm64->aarch64) - add_cores: cores present on a target but not detected by the scraper
- remove_cores: cores detected by the scraper but not actually functional
Overrides are applied by load_target_config() in common.py after loading
the scraped data. The scraper itself does not need to handle overrides.
Module contract
from scripts.scraper.targets import BaseTargetScraper
PLATFORM_NAME = "myplatform_targets"
class Scraper(BaseTargetScraper):
def __init__(self):
super().__init__(url="https://...")
def fetch_targets(self) -> dict:
# Fetch and parse target data
...
Register the target scraper in _registry.yml:
myplatform:
target_scraper: myplatform_targets
target_source: https://...
Existing target scrapers
| Scraper | Source | Approach |
|---|---|---|
retroarch_targets |
libretro buildbot nightly | Scrape directory listings for each target arch |
batocera_targets |
Config.in + es_systems.yml | Cross-reference kernel config with system definitions |
emudeck_targets |
EmuScripts + RetroArch cores | GitHub API for script availability per OS |
retropie_targets |
scriptmodules + rp_module_flags | Parse Bash scriptmodules for platform flags |
Parser modules
Shared parsers in scripts/scraper/ handle formats used by multiple scrapers
or formats complex enough to warrant dedicated parsing logic.
dat_parser
Parses clrmamepro DAT format as used in RetroArch's System.dat:
game (
name "System"
comment "Platform Name"
rom ( name filename size 12345 crc ABCD1234 md5 ... sha1 ... )
)
Produces DatRom dataclass instances with name, size, crc32, md5, sha1,
and system fields. The libretro_scraper uses this parser.
mame_parser
Parses MAME C source files to extract BIOS root sets. Handles:
- Machine declaration macros:
GAME,SYST,COMP,CONS MACHINE_IS_BIOS_ROOTflag detectionROM_START/ROM_ENDblocksROM_LOADvariants andROM_REGIONdeclarationsROM_SYSTEM_BIOSentriesNO_DUMPfiltering andBAD_DUMPflagging
Used by mame_hash_scraper to auto-fetch BIOS hashes from MAME driver sources
for each tagged MAME version.
fbneo_parser
Parses FBNeo C source files:
BurnRomInfostructs (static ROM arrays with name, size, CRC)BurnDriverstructs (driver registration withBDF_BOARDROMflag)- BIOS set identification via the boardrom flag
Used by fbneo_hash_scraper to extract BIOS ROM definitions.
_hash_merge
Text-based YAML patching that merges fetched hash data into emulator profiles while preserving formatting. Two strategies:
- MAME: updates
bios_zipentries withcontentslists (name, size, CRC32) - FBNeo: updates individual ROM entries grouped by
archivefield
The merge preserves fields the hash data does not generate (system, note, required) and leaves entries not present in the hash data untouched. Uses text-level YAML manipulation rather than load-dump to maintain human-readable formatting.
Testing
Development workflow
-
Start with --dry-run. It's helpful to preview before writing output:
python -m scripts.scraper.myplatform_scraper --dry-run -
Check JSON output for data quality:
python -m scripts.scraper.myplatform_scraper --json | python -m json.tool | head -50 -
Compare with existing YAML if updating a scraper:
python -m scripts.scraper.myplatform_scraper --output /tmp/test.yml diff platforms/myplatform.yml /tmp/test.yml -
Run verification after generating:
python scripts/verify.py --platform myplatform python scripts/verify.py --platform myplatform --verbose -
Run the full pipeline before committing:
python scripts/pipeline.py --offline
Round-trip testing
If an exporter exists for the platform, validate the scrape-export-compare cycle:
# Scrape upstream -> platform YAML
python -m scripts.scraper.myplatform_scraper --output /tmp/scraped.yml
# Export truth data -> native format
python scripts/export_native.py --platform myplatform --output /tmp/exported.json
# Compare
diff <(python -m scripts.scraper.myplatform_scraper --json | python -m json.tool) \
/tmp/exported.json
Common issues
| Symptom | Cause | Fix |
|---|---|---|
| Unmapped slug warnings | New system added upstream | Add mapping to SLUG_MAP |
| Empty requirements list | Upstream format changed | Check validate_format(), update parser |
| Hash mismatch in verify | Upstream updated hashes | Re-scrape and regenerate platform YAML |
| Scraper hangs | URL unreachable, no timeout | _fetch_raw() has 30s timeout; check URL |
Response exceeds 50 MB |
Upstream file grew | Investigate; may need chunked parsing |
validate_format fails |
Upstream restructured | Update both validate_format and fetch_requirements |
E2E tests
The project's test suite (tests/test_e2e.py) covers scraper integration at the
pipeline level. When adding a new scraper, verify that the full pipeline passes:
python -m unittest tests.test_e2e
python scripts/pipeline.py --offline
Both must pass before the scraper is considered complete.