mirror of
https://github.com/Abdess/retroarch_system.git
synced 2026-04-13 12:22:33 -05:00
9 new wiki pages: getting-started, faq, troubleshooting, advanced-usage, verification-modes, adding-a-platform, adding-a-scraper, testing-guide, release-process. Updated architecture.md with mermaid diagrams, tools.md with full pipeline and target/exporter sections, profiling.md with missing fields, index.md with glossary and nav links. Expanded CONTRIBUTING.md from stub to full contributor guide. Filter .old.yml from load_emulator_profiles, generate_db alias collection, and generate_readme counts. Fix BizHawk sha1 mode in tools.md, fix RetroPie path, fix export_truth.py typos.
424 lines
14 KiB
Markdown
424 lines
14 KiB
Markdown
# Adding a scraper
|
|
|
|
How to create or modify a scraper for fetching BIOS requirements from upstream
|
|
platform sources.
|
|
|
|
## Scraper architecture
|
|
|
|
### Plugin discovery
|
|
|
|
Scrapers are discovered automatically at import time. The `scripts/scraper/__init__.py`
|
|
module uses `pkgutil.iter_modules` to scan for files matching `*_scraper.py` in
|
|
the scraper directory. Each module must export:
|
|
|
|
- `PLATFORM_NAME: str` - the platform identifier (matches `_registry.yml`)
|
|
- `Scraper: class` - a subclass of `BaseScraper`
|
|
|
|
No registration code is needed. Drop a file, export the two names, and it works.
|
|
|
|
```python
|
|
# scripts/scraper/__init__.py (simplified)
|
|
for finder, name, ispkg in pkgutil.iter_modules([package_dir]):
|
|
if not name.endswith("_scraper"):
|
|
continue
|
|
module = importlib.import_module(f".{name}", package=__package__)
|
|
# looks for PLATFORM_NAME and Scraper attributes
|
|
```
|
|
|
|
### BaseScraper ABC
|
|
|
|
`BaseScraper` (`scripts/scraper/base_scraper.py`) provides the foundation:
|
|
|
|
```
|
|
BaseScraper
|
|
__init__(url: str)
|
|
_fetch_raw() -> str # HTTP GET, cached, 50 MB limit
|
|
fetch_requirements() -> list # abstract: parse upstream data
|
|
validate_format(raw_data) -> bool # abstract: detect format changes
|
|
compare_with_config(config) -> ChangeSet # diff against existing YAML
|
|
test_connection() -> bool # reachability check
|
|
```
|
|
|
|
`_fetch_raw()` handles HTTP with `urllib.request`, sets a `retrobios-scraper/1.0`
|
|
user-agent, enforces a 30-second timeout, and reads the response in 64 KB chunks
|
|
with a 50 MB hard limit to prevent memory exhaustion. The result is cached on the
|
|
instance after the first call.
|
|
|
|
### BiosRequirement
|
|
|
|
A dataclass representing a single BIOS file entry:
|
|
|
|
```python
|
|
@dataclass
|
|
class BiosRequirement:
|
|
name: str # filename
|
|
system: str # retrobios system ID
|
|
sha1: str | None = None
|
|
md5: str | None = None
|
|
crc32: str | None = None
|
|
size: int | None = None
|
|
destination: str = "" # relative path in BIOS dir
|
|
required: bool = True
|
|
zipped_file: str | None = None # ROM name inside a ZIP
|
|
native_id: str | None = None # original system name
|
|
```
|
|
|
|
### ChangeSet
|
|
|
|
Returned by `compare_with_config()`. Contains:
|
|
|
|
- `added: list[BiosRequirement]` - new files not in the existing config
|
|
- `removed: list[BiosRequirement]` - files present in config but gone upstream
|
|
- `modified: list[tuple[BiosRequirement, BiosRequirement]]` - hash changes
|
|
- `has_changes: bool` - True if any of the above are non-empty
|
|
- `summary() -> str` - human-readable summary (e.g. `+3 added, ~1 modified`)
|
|
|
|
### scraper_cli
|
|
|
|
`scraper_cli(scraper_class, description)` provides a shared CLI with three modes:
|
|
|
|
| Flag | Behavior |
|
|
|------|----------|
|
|
| `--dry-run` | Fetch and print a summary grouped by system |
|
|
| `--json` | Output all requirements as JSON |
|
|
| `--output FILE` | Write platform YAML to FILE |
|
|
|
|
When `--output` targets an existing file, the CLI preserves keys not generated by
|
|
the scraper (e.g. `data_directories`, manual additions). Only the `systems` section
|
|
is replaced. If the scraper defines `generate_platform_yaml()`, that method is used
|
|
instead of the generic YAML builder.
|
|
|
|
### Helper functions
|
|
|
|
Two additional functions in `base_scraper.py`:
|
|
|
|
- `fetch_github_latest_version(repo)` - fetches the latest release tag via GitHub API.
|
|
- `fetch_github_latest_tag(repo, prefix)` - fetches the most recent tag matching
|
|
an optional prefix.
|
|
|
|
## Creating a BIOS scraper
|
|
|
|
### Minimal example
|
|
|
|
Based on the RomM scraper pattern (JSON source, flat structure):
|
|
|
|
```python
|
|
"""Scraper for MyPlatform BIOS requirements."""
|
|
from __future__ import annotations
|
|
|
|
import json
|
|
import sys
|
|
|
|
try:
|
|
from .base_scraper import BaseScraper, BiosRequirement
|
|
except ImportError:
|
|
from base_scraper import BaseScraper, BiosRequirement
|
|
|
|
PLATFORM_NAME = "myplatform"
|
|
|
|
SOURCE_URL = "https://raw.githubusercontent.com/org/repo/main/bios_list.json"
|
|
|
|
SLUG_MAP: dict[str, str] = {
|
|
"psx": "sony-playstation",
|
|
"saturn": "sega-saturn",
|
|
}
|
|
|
|
|
|
class Scraper(BaseScraper):
|
|
def __init__(self, url: str = SOURCE_URL):
|
|
super().__init__(url=url)
|
|
|
|
def fetch_requirements(self) -> list[BiosRequirement]:
|
|
raw = self._fetch_raw()
|
|
if not self.validate_format(raw):
|
|
raise ValueError("Format validation failed")
|
|
|
|
data = json.loads(raw)
|
|
requirements = []
|
|
for entry in data:
|
|
system = SLUG_MAP.get(entry["platform"])
|
|
if not system:
|
|
print(f"Warning: unmapped '{entry['platform']}'", file=sys.stderr)
|
|
continue
|
|
requirements.append(BiosRequirement(
|
|
name=entry["filename"],
|
|
system=system,
|
|
md5=entry.get("md5"),
|
|
sha1=entry.get("sha1"),
|
|
size=entry.get("size"),
|
|
destination=entry["filename"],
|
|
required=entry.get("required", True),
|
|
))
|
|
return requirements
|
|
|
|
def validate_format(self, raw_data: str) -> bool:
|
|
try:
|
|
data = json.loads(raw_data)
|
|
except (json.JSONDecodeError, TypeError):
|
|
return False
|
|
return isinstance(data, list) and len(data) > 0
|
|
|
|
|
|
def main():
|
|
from scripts.scraper.base_scraper import scraper_cli
|
|
scraper_cli(Scraper, "Scrape MyPlatform BIOS requirements")
|
|
|
|
if __name__ == "__main__":
|
|
main()
|
|
```
|
|
|
|
### Parsing different upstream formats
|
|
|
|
Each platform stores its BIOS requirements differently. The scraper's job is to
|
|
normalize them into `BiosRequirement` entries.
|
|
|
|
| Format | Example | Parsing approach |
|
|
|--------|---------|-----------------|
|
|
| JSON | RomM `known_bios_files.json` | `json.loads()`, iterate keys |
|
|
| XML | Recalbox `es_bios.xml` | `xml.etree.ElementTree`, xpath or iter |
|
|
| clrmamepro DAT | RetroArch `System.dat` | Use `dat_parser` module (see below) |
|
|
| Python dict | Batocera `batocera-systems` | `ast.literal_eval` or regex extraction |
|
|
| Bash script | EmuDeck `checkBIOS.sh` | Line-by-line regex parsing |
|
|
| C# source | BizHawk `FirmwareDatabase.cs` | Regex for method calls and string literals |
|
|
| C source | MAME/FBNeo drivers | Use `mame_parser` or `fbneo_parser` (see below) |
|
|
| JSON (GitHub API) | RetroDECK component manifests | `json.loads()` per manifest file |
|
|
|
|
### System ID mapping
|
|
|
|
Every scraper maintains a `SLUG_MAP` (or equivalent) that translates the platform's
|
|
native system identifiers to retrobios system IDs. The retrobios system ID format
|
|
is `manufacturer-console` in lowercase with hyphens (e.g. `sony-playstation`,
|
|
`sega-mega-drive`, `nintendo-gba`).
|
|
|
|
When a native slug has no mapping, print a warning to stderr. This surfaces new
|
|
systems added upstream that need to be mapped.
|
|
|
|
System ID consistency matters for cross-platform operations. The same console must use the same ID across all scrapers and platforms. Before inventing a new ID, check existing profiles and platform YAMLs for precedent. The canonical format is `manufacturer-console` in lowercase with hyphens (e.g., `sony-playstation`, `sega-mega-drive`). The `SYSTEM_ALIASES` dict in `common.py` maps common variations to canonical IDs.
|
|
|
|
### Hash normalization
|
|
|
|
- Normalize all hashes to lowercase hex strings.
|
|
- Handle missing hashes gracefully (set to `None`, not empty string).
|
|
- Some platforms provide multiple hash types per entry. Populate whichever fields
|
|
are available.
|
|
- Batocera uses 29-character truncated MD5 hashes in some entries. The resolution
|
|
layer handles prefix matching, but the scraper should store the hash as-is.
|
|
|
|
Scraped data reflects what the upstream declares, which may not match reality. The scraper's job is faithful transcription of upstream data, not correction. Corrections happen in the emulator profiles (source-verified) and in `_shared.yml` (curated). If a scraper detects an obviously wrong hash or filename, log a warning but still include the upstream value. The divergence will surface during truth diffing.
|
|
|
|
## Creating a target scraper
|
|
|
|
Target scrapers determine which emulator cores are available on each hardware
|
|
target. They live in `scripts/scraper/targets/` and follow the same auto-discovery
|
|
pattern (`*_targets_scraper.py`).
|
|
|
|
### BaseTargetScraper ABC
|
|
|
|
```
|
|
BaseTargetScraper
|
|
__init__(url: str)
|
|
fetch_targets() -> dict # abstract: return target data
|
|
write_output(data, path) # write YAML to disk
|
|
```
|
|
|
|
### Output format
|
|
|
|
`fetch_targets()` must return a dict with this structure:
|
|
|
|
```python
|
|
{
|
|
"platform": "myplatform",
|
|
"source": "https://...",
|
|
"scraped_at": "2026-03-30T12:00:00Z",
|
|
"targets": {
|
|
"x86_64": {
|
|
"architecture": "x86_64",
|
|
"cores": ["beetle_psx", "dolphin", "snes9x"],
|
|
},
|
|
"rpi4": {
|
|
"architecture": "aarch64",
|
|
"cores": ["pcsx_rearmed", "mgba"],
|
|
},
|
|
},
|
|
}
|
|
```
|
|
|
|
The `targets` dict maps target names to their available cores. Core names must
|
|
match the names used in emulator profile `cores:` fields for the target filtering
|
|
pipeline to work correctly.
|
|
|
|
### Overrides
|
|
|
|
`platforms/targets/_overrides.yml` provides post-scrape adjustments:
|
|
|
|
- **aliases**: map alternate target names to canonical ones (e.g. `arm64` -> `aarch64`)
|
|
- **add_cores**: cores present on a target but not detected by the scraper
|
|
- **remove_cores**: cores detected by the scraper but not actually functional
|
|
|
|
Overrides are applied by `load_target_config()` in `common.py` after loading
|
|
the scraped data. The scraper itself does not need to handle overrides.
|
|
|
|
### Module contract
|
|
|
|
```python
|
|
from scripts.scraper.targets import BaseTargetScraper
|
|
|
|
PLATFORM_NAME = "myplatform_targets"
|
|
|
|
class Scraper(BaseTargetScraper):
|
|
def __init__(self):
|
|
super().__init__(url="https://...")
|
|
|
|
def fetch_targets(self) -> dict:
|
|
# Fetch and parse target data
|
|
...
|
|
```
|
|
|
|
Register the target scraper in `_registry.yml`:
|
|
|
|
```yaml
|
|
myplatform:
|
|
target_scraper: myplatform_targets
|
|
target_source: https://...
|
|
```
|
|
|
|
### Existing target scrapers
|
|
|
|
| Scraper | Source | Approach |
|
|
|---------|--------|----------|
|
|
| `retroarch_targets` | libretro buildbot nightly | Scrape directory listings for each target arch |
|
|
| `batocera_targets` | Config.in + es_systems.yml | Cross-reference kernel config with system definitions |
|
|
| `emudeck_targets` | EmuScripts + RetroArch cores | GitHub API for script availability per OS |
|
|
| `retropie_targets` | scriptmodules + rp_module_flags | Parse Bash scriptmodules for platform flags |
|
|
|
|
## Parser modules
|
|
|
|
Shared parsers in `scripts/scraper/` handle formats used by multiple scrapers
|
|
or formats complex enough to warrant dedicated parsing logic.
|
|
|
|
### dat_parser
|
|
|
|
Parses clrmamepro DAT format as used in RetroArch's `System.dat`:
|
|
|
|
```
|
|
game (
|
|
name "System"
|
|
comment "Platform Name"
|
|
rom ( name filename size 12345 crc ABCD1234 md5 ... sha1 ... )
|
|
)
|
|
```
|
|
|
|
Produces `DatRom` dataclass instances with `name`, `size`, `crc32`, `md5`, `sha1`,
|
|
and `system` fields. The `libretro_scraper` uses this parser.
|
|
|
|
### mame_parser
|
|
|
|
Parses MAME C source files to extract BIOS root sets. Handles:
|
|
|
|
- Machine declaration macros: `GAME`, `SYST`, `COMP`, `CONS`
|
|
- `MACHINE_IS_BIOS_ROOT` flag detection
|
|
- `ROM_START`/`ROM_END` blocks
|
|
- `ROM_LOAD` variants and `ROM_REGION` declarations
|
|
- `ROM_SYSTEM_BIOS` entries
|
|
- `NO_DUMP` filtering and `BAD_DUMP` flagging
|
|
|
|
Used by `mame_hash_scraper` to auto-fetch BIOS hashes from MAME driver sources
|
|
for each tagged MAME version.
|
|
|
|
### fbneo_parser
|
|
|
|
Parses FBNeo C source files:
|
|
|
|
- `BurnRomInfo` structs (static ROM arrays with name, size, CRC)
|
|
- `BurnDriver` structs (driver registration with `BDF_BOARDROM` flag)
|
|
- BIOS set identification via the boardrom flag
|
|
|
|
Used by `fbneo_hash_scraper` to extract BIOS ROM definitions.
|
|
|
|
### _hash_merge
|
|
|
|
Text-based YAML patching that merges fetched hash data into emulator profiles
|
|
while preserving formatting. Two strategies:
|
|
|
|
- **MAME**: updates `bios_zip` entries with `contents` lists (name, size, CRC32)
|
|
- **FBNeo**: updates individual ROM entries grouped by `archive` field
|
|
|
|
The merge preserves fields the hash data does not generate (system, note, required)
|
|
and leaves entries not present in the hash data untouched. Uses text-level YAML
|
|
manipulation rather than load-dump to maintain human-readable formatting.
|
|
|
|
## Testing
|
|
|
|
### Development workflow
|
|
|
|
1. **Start with --dry-run**. It's helpful to preview before writing output:
|
|
|
|
```bash
|
|
python -m scripts.scraper.myplatform_scraper --dry-run
|
|
```
|
|
|
|
2. **Check JSON output** for data quality:
|
|
|
|
```bash
|
|
python -m scripts.scraper.myplatform_scraper --json | python -m json.tool | head -50
|
|
```
|
|
|
|
3. **Compare with existing YAML** if updating a scraper:
|
|
|
|
```bash
|
|
python -m scripts.scraper.myplatform_scraper --output /tmp/test.yml
|
|
diff platforms/myplatform.yml /tmp/test.yml
|
|
```
|
|
|
|
4. **Run verification** after generating:
|
|
|
|
```bash
|
|
python scripts/verify.py --platform myplatform
|
|
python scripts/verify.py --platform myplatform --verbose
|
|
```
|
|
|
|
5. **Run the full pipeline** before committing:
|
|
|
|
```bash
|
|
python scripts/pipeline.py --offline
|
|
```
|
|
|
|
### Round-trip testing
|
|
|
|
If an exporter exists for the platform, validate the scrape-export-compare cycle:
|
|
|
|
```bash
|
|
# Scrape upstream -> platform YAML
|
|
python -m scripts.scraper.myplatform_scraper --output /tmp/scraped.yml
|
|
|
|
# Export truth data -> native format
|
|
python scripts/export_native.py --platform myplatform --output /tmp/exported.json
|
|
|
|
# Compare
|
|
diff <(python -m scripts.scraper.myplatform_scraper --json | python -m json.tool) \
|
|
/tmp/exported.json
|
|
```
|
|
|
|
### Common issues
|
|
|
|
| Symptom | Cause | Fix |
|
|
|---------|-------|-----|
|
|
| Unmapped slug warnings | New system added upstream | Add mapping to `SLUG_MAP` |
|
|
| Empty requirements list | Upstream format changed | Check `validate_format()`, update parser |
|
|
| Hash mismatch in verify | Upstream updated hashes | Re-scrape and regenerate platform YAML |
|
|
| Scraper hangs | URL unreachable, no timeout | `_fetch_raw()` has 30s timeout; check URL |
|
|
| `Response exceeds 50 MB` | Upstream file grew | Investigate; may need chunked parsing |
|
|
| `validate_format` fails | Upstream restructured | Update both `validate_format` and `fetch_requirements` |
|
|
|
|
### E2E tests
|
|
|
|
The project's test suite (`tests/test_e2e.py`) covers scraper integration at the
|
|
pipeline level. When adding a new scraper, verify that the full pipeline passes:
|
|
|
|
```bash
|
|
python -m unittest tests.test_e2e
|
|
python scripts/pipeline.py --offline
|
|
```
|
|
|
|
Both must pass before the scraper is considered complete.
|