Files
libretro/wiki/adding-a-scraper.md
Abdessamad Derraz d0dd05ddf6 docs: add wiki pages for all audiences, fix .old.yml leak
9 new wiki pages: getting-started, faq, troubleshooting,
advanced-usage, verification-modes, adding-a-platform,
adding-a-scraper, testing-guide, release-process.

Updated architecture.md with mermaid diagrams, tools.md with
full pipeline and target/exporter sections, profiling.md with
missing fields, index.md with glossary and nav links.

Expanded CONTRIBUTING.md from stub to full contributor guide.

Filter .old.yml from load_emulator_profiles, generate_db alias
collection, and generate_readme counts. Fix BizHawk sha1 mode
in tools.md, fix RetroPie path, fix export_truth.py typos.
2026-03-30 23:58:12 +02:00

424 lines
14 KiB
Markdown

# Adding a scraper
How to create or modify a scraper for fetching BIOS requirements from upstream
platform sources.
## Scraper architecture
### Plugin discovery
Scrapers are discovered automatically at import time. The `scripts/scraper/__init__.py`
module uses `pkgutil.iter_modules` to scan for files matching `*_scraper.py` in
the scraper directory. Each module must export:
- `PLATFORM_NAME: str` - the platform identifier (matches `_registry.yml`)
- `Scraper: class` - a subclass of `BaseScraper`
No registration code is needed. Drop a file, export the two names, and it works.
```python
# scripts/scraper/__init__.py (simplified)
for finder, name, ispkg in pkgutil.iter_modules([package_dir]):
if not name.endswith("_scraper"):
continue
module = importlib.import_module(f".{name}", package=__package__)
# looks for PLATFORM_NAME and Scraper attributes
```
### BaseScraper ABC
`BaseScraper` (`scripts/scraper/base_scraper.py`) provides the foundation:
```
BaseScraper
__init__(url: str)
_fetch_raw() -> str # HTTP GET, cached, 50 MB limit
fetch_requirements() -> list # abstract: parse upstream data
validate_format(raw_data) -> bool # abstract: detect format changes
compare_with_config(config) -> ChangeSet # diff against existing YAML
test_connection() -> bool # reachability check
```
`_fetch_raw()` handles HTTP with `urllib.request`, sets a `retrobios-scraper/1.0`
user-agent, enforces a 30-second timeout, and reads the response in 64 KB chunks
with a 50 MB hard limit to prevent memory exhaustion. The result is cached on the
instance after the first call.
### BiosRequirement
A dataclass representing a single BIOS file entry:
```python
@dataclass
class BiosRequirement:
name: str # filename
system: str # retrobios system ID
sha1: str | None = None
md5: str | None = None
crc32: str | None = None
size: int | None = None
destination: str = "" # relative path in BIOS dir
required: bool = True
zipped_file: str | None = None # ROM name inside a ZIP
native_id: str | None = None # original system name
```
### ChangeSet
Returned by `compare_with_config()`. Contains:
- `added: list[BiosRequirement]` - new files not in the existing config
- `removed: list[BiosRequirement]` - files present in config but gone upstream
- `modified: list[tuple[BiosRequirement, BiosRequirement]]` - hash changes
- `has_changes: bool` - True if any of the above are non-empty
- `summary() -> str` - human-readable summary (e.g. `+3 added, ~1 modified`)
### scraper_cli
`scraper_cli(scraper_class, description)` provides a shared CLI with three modes:
| Flag | Behavior |
|------|----------|
| `--dry-run` | Fetch and print a summary grouped by system |
| `--json` | Output all requirements as JSON |
| `--output FILE` | Write platform YAML to FILE |
When `--output` targets an existing file, the CLI preserves keys not generated by
the scraper (e.g. `data_directories`, manual additions). Only the `systems` section
is replaced. If the scraper defines `generate_platform_yaml()`, that method is used
instead of the generic YAML builder.
### Helper functions
Two additional functions in `base_scraper.py`:
- `fetch_github_latest_version(repo)` - fetches the latest release tag via GitHub API.
- `fetch_github_latest_tag(repo, prefix)` - fetches the most recent tag matching
an optional prefix.
## Creating a BIOS scraper
### Minimal example
Based on the RomM scraper pattern (JSON source, flat structure):
```python
"""Scraper for MyPlatform BIOS requirements."""
from __future__ import annotations
import json
import sys
try:
from .base_scraper import BaseScraper, BiosRequirement
except ImportError:
from base_scraper import BaseScraper, BiosRequirement
PLATFORM_NAME = "myplatform"
SOURCE_URL = "https://raw.githubusercontent.com/org/repo/main/bios_list.json"
SLUG_MAP: dict[str, str] = {
"psx": "sony-playstation",
"saturn": "sega-saturn",
}
class Scraper(BaseScraper):
def __init__(self, url: str = SOURCE_URL):
super().__init__(url=url)
def fetch_requirements(self) -> list[BiosRequirement]:
raw = self._fetch_raw()
if not self.validate_format(raw):
raise ValueError("Format validation failed")
data = json.loads(raw)
requirements = []
for entry in data:
system = SLUG_MAP.get(entry["platform"])
if not system:
print(f"Warning: unmapped '{entry['platform']}'", file=sys.stderr)
continue
requirements.append(BiosRequirement(
name=entry["filename"],
system=system,
md5=entry.get("md5"),
sha1=entry.get("sha1"),
size=entry.get("size"),
destination=entry["filename"],
required=entry.get("required", True),
))
return requirements
def validate_format(self, raw_data: str) -> bool:
try:
data = json.loads(raw_data)
except (json.JSONDecodeError, TypeError):
return False
return isinstance(data, list) and len(data) > 0
def main():
from scripts.scraper.base_scraper import scraper_cli
scraper_cli(Scraper, "Scrape MyPlatform BIOS requirements")
if __name__ == "__main__":
main()
```
### Parsing different upstream formats
Each platform stores its BIOS requirements differently. The scraper's job is to
normalize them into `BiosRequirement` entries.
| Format | Example | Parsing approach |
|--------|---------|-----------------|
| JSON | RomM `known_bios_files.json` | `json.loads()`, iterate keys |
| XML | Recalbox `es_bios.xml` | `xml.etree.ElementTree`, xpath or iter |
| clrmamepro DAT | RetroArch `System.dat` | Use `dat_parser` module (see below) |
| Python dict | Batocera `batocera-systems` | `ast.literal_eval` or regex extraction |
| Bash script | EmuDeck `checkBIOS.sh` | Line-by-line regex parsing |
| C# source | BizHawk `FirmwareDatabase.cs` | Regex for method calls and string literals |
| C source | MAME/FBNeo drivers | Use `mame_parser` or `fbneo_parser` (see below) |
| JSON (GitHub API) | RetroDECK component manifests | `json.loads()` per manifest file |
### System ID mapping
Every scraper maintains a `SLUG_MAP` (or equivalent) that translates the platform's
native system identifiers to retrobios system IDs. The retrobios system ID format
is `manufacturer-console` in lowercase with hyphens (e.g. `sony-playstation`,
`sega-mega-drive`, `nintendo-gba`).
When a native slug has no mapping, print a warning to stderr. This surfaces new
systems added upstream that need to be mapped.
System ID consistency matters for cross-platform operations. The same console must use the same ID across all scrapers and platforms. Before inventing a new ID, check existing profiles and platform YAMLs for precedent. The canonical format is `manufacturer-console` in lowercase with hyphens (e.g., `sony-playstation`, `sega-mega-drive`). The `SYSTEM_ALIASES` dict in `common.py` maps common variations to canonical IDs.
### Hash normalization
- Normalize all hashes to lowercase hex strings.
- Handle missing hashes gracefully (set to `None`, not empty string).
- Some platforms provide multiple hash types per entry. Populate whichever fields
are available.
- Batocera uses 29-character truncated MD5 hashes in some entries. The resolution
layer handles prefix matching, but the scraper should store the hash as-is.
Scraped data reflects what the upstream declares, which may not match reality. The scraper's job is faithful transcription of upstream data, not correction. Corrections happen in the emulator profiles (source-verified) and in `_shared.yml` (curated). If a scraper detects an obviously wrong hash or filename, log a warning but still include the upstream value. The divergence will surface during truth diffing.
## Creating a target scraper
Target scrapers determine which emulator cores are available on each hardware
target. They live in `scripts/scraper/targets/` and follow the same auto-discovery
pattern (`*_targets_scraper.py`).
### BaseTargetScraper ABC
```
BaseTargetScraper
__init__(url: str)
fetch_targets() -> dict # abstract: return target data
write_output(data, path) # write YAML to disk
```
### Output format
`fetch_targets()` must return a dict with this structure:
```python
{
"platform": "myplatform",
"source": "https://...",
"scraped_at": "2026-03-30T12:00:00Z",
"targets": {
"x86_64": {
"architecture": "x86_64",
"cores": ["beetle_psx", "dolphin", "snes9x"],
},
"rpi4": {
"architecture": "aarch64",
"cores": ["pcsx_rearmed", "mgba"],
},
},
}
```
The `targets` dict maps target names to their available cores. Core names must
match the names used in emulator profile `cores:` fields for the target filtering
pipeline to work correctly.
### Overrides
`platforms/targets/_overrides.yml` provides post-scrape adjustments:
- **aliases**: map alternate target names to canonical ones (e.g. `arm64` -> `aarch64`)
- **add_cores**: cores present on a target but not detected by the scraper
- **remove_cores**: cores detected by the scraper but not actually functional
Overrides are applied by `load_target_config()` in `common.py` after loading
the scraped data. The scraper itself does not need to handle overrides.
### Module contract
```python
from scripts.scraper.targets import BaseTargetScraper
PLATFORM_NAME = "myplatform_targets"
class Scraper(BaseTargetScraper):
def __init__(self):
super().__init__(url="https://...")
def fetch_targets(self) -> dict:
# Fetch and parse target data
...
```
Register the target scraper in `_registry.yml`:
```yaml
myplatform:
target_scraper: myplatform_targets
target_source: https://...
```
### Existing target scrapers
| Scraper | Source | Approach |
|---------|--------|----------|
| `retroarch_targets` | libretro buildbot nightly | Scrape directory listings for each target arch |
| `batocera_targets` | Config.in + es_systems.yml | Cross-reference kernel config with system definitions |
| `emudeck_targets` | EmuScripts + RetroArch cores | GitHub API for script availability per OS |
| `retropie_targets` | scriptmodules + rp_module_flags | Parse Bash scriptmodules for platform flags |
## Parser modules
Shared parsers in `scripts/scraper/` handle formats used by multiple scrapers
or formats complex enough to warrant dedicated parsing logic.
### dat_parser
Parses clrmamepro DAT format as used in RetroArch's `System.dat`:
```
game (
name "System"
comment "Platform Name"
rom ( name filename size 12345 crc ABCD1234 md5 ... sha1 ... )
)
```
Produces `DatRom` dataclass instances with `name`, `size`, `crc32`, `md5`, `sha1`,
and `system` fields. The `libretro_scraper` uses this parser.
### mame_parser
Parses MAME C source files to extract BIOS root sets. Handles:
- Machine declaration macros: `GAME`, `SYST`, `COMP`, `CONS`
- `MACHINE_IS_BIOS_ROOT` flag detection
- `ROM_START`/`ROM_END` blocks
- `ROM_LOAD` variants and `ROM_REGION` declarations
- `ROM_SYSTEM_BIOS` entries
- `NO_DUMP` filtering and `BAD_DUMP` flagging
Used by `mame_hash_scraper` to auto-fetch BIOS hashes from MAME driver sources
for each tagged MAME version.
### fbneo_parser
Parses FBNeo C source files:
- `BurnRomInfo` structs (static ROM arrays with name, size, CRC)
- `BurnDriver` structs (driver registration with `BDF_BOARDROM` flag)
- BIOS set identification via the boardrom flag
Used by `fbneo_hash_scraper` to extract BIOS ROM definitions.
### _hash_merge
Text-based YAML patching that merges fetched hash data into emulator profiles
while preserving formatting. Two strategies:
- **MAME**: updates `bios_zip` entries with `contents` lists (name, size, CRC32)
- **FBNeo**: updates individual ROM entries grouped by `archive` field
The merge preserves fields the hash data does not generate (system, note, required)
and leaves entries not present in the hash data untouched. Uses text-level YAML
manipulation rather than load-dump to maintain human-readable formatting.
## Testing
### Development workflow
1. **Start with --dry-run**. It's helpful to preview before writing output:
```bash
python -m scripts.scraper.myplatform_scraper --dry-run
```
2. **Check JSON output** for data quality:
```bash
python -m scripts.scraper.myplatform_scraper --json | python -m json.tool | head -50
```
3. **Compare with existing YAML** if updating a scraper:
```bash
python -m scripts.scraper.myplatform_scraper --output /tmp/test.yml
diff platforms/myplatform.yml /tmp/test.yml
```
4. **Run verification** after generating:
```bash
python scripts/verify.py --platform myplatform
python scripts/verify.py --platform myplatform --verbose
```
5. **Run the full pipeline** before committing:
```bash
python scripts/pipeline.py --offline
```
### Round-trip testing
If an exporter exists for the platform, validate the scrape-export-compare cycle:
```bash
# Scrape upstream -> platform YAML
python -m scripts.scraper.myplatform_scraper --output /tmp/scraped.yml
# Export truth data -> native format
python scripts/export_native.py --platform myplatform --output /tmp/exported.json
# Compare
diff <(python -m scripts.scraper.myplatform_scraper --json | python -m json.tool) \
/tmp/exported.json
```
### Common issues
| Symptom | Cause | Fix |
|---------|-------|-----|
| Unmapped slug warnings | New system added upstream | Add mapping to `SLUG_MAP` |
| Empty requirements list | Upstream format changed | Check `validate_format()`, update parser |
| Hash mismatch in verify | Upstream updated hashes | Re-scrape and regenerate platform YAML |
| Scraper hangs | URL unreachable, no timeout | `_fetch_raw()` has 30s timeout; check URL |
| `Response exceeds 50 MB` | Upstream file grew | Investigate; may need chunked parsing |
| `validate_format` fails | Upstream restructured | Update both `validate_format` and `fetch_requirements` |
### E2E tests
The project's test suite (`tests/test_e2e.py`) covers scraper integration at the
pipeline level. When adding a new scraper, verify that the full pipeline passes:
```bash
python -m unittest tests.test_e2e
python scripts/pipeline.py --offline
```
Both must pass before the scraper is considered complete.