docs: add wiki pages for all audiences, fix .old.yml leak

9 new wiki pages: getting-started, faq, troubleshooting,
advanced-usage, verification-modes, adding-a-platform,
adding-a-scraper, testing-guide, release-process.

Updated architecture.md with mermaid diagrams, tools.md with
full pipeline and target/exporter sections, profiling.md with
missing fields, index.md with glossary and nav links.

Expanded CONTRIBUTING.md from stub to full contributor guide.

Filter .old.yml from load_emulator_profiles, generate_db alias
collection, and generate_readme counts. Fix BizHawk sha1 mode
in tools.md, fix RetroPie path, fix export_truth.py typos.
This commit is contained in:
Abdessamad Derraz
2026-03-30 22:51:29 +02:00
parent 038c3d3b40
commit d4b0af0a38
20 changed files with 2742 additions and 63 deletions

View File

@@ -1,14 +1,109 @@
# Contributing to RetroBIOS
## Add a BIOS file
## Types of contributions
1. Fork this repository
2. Place the file in `bios/Manufacturer/Console/filename`
3. Variants (alternate hashes): `bios/Manufacturer/Console/.variants/`
4. Create a Pull Request - checksums are verified automatically
- **Add a BIOS file** - a great way to get started. Fork, add the file, open a PR.
- **Create an emulator profile** - document what a core actually loads from source code. See the [profiling guide](https://abdess.github.io/retrobios/wiki/profiling/).
- **Add a platform** - integrate a new frontend (scraper + YAML config). See [adding a platform](https://abdess.github.io/retrobios/wiki/adding-a-platform/).
- **Add or fix a scraper** - parse upstream sources for BIOS requirements. See [adding a scraper](https://abdess.github.io/retrobios/wiki/adding-a-scraper/).
- **Fix a bug or improve tooling** - Python scripts in `scripts/`, single dependency (`pyyaml`).
## File conventions
## Local setup
- Files >50 MB go in GitHub release assets (`large-files` release)
- RPG Maker and ScummVM directories are excluded from deduplication
- See the [documentation site](https://abdess.github.io/retrobios/) for full details
```bash
git clone https://github.com/Abdess/retrobios.git
cd retrobios
pip install pyyaml
# run tests
python -m unittest tests.test_e2e -v
# run full pipeline (DB + verify + packs + consistency check)
python scripts/pipeline.py --offline
```
Requires Python 3.10 or later.
## Adding a BIOS file
1. Place the file in `bios/Manufacturer/Console/filename`.
2. Alternate versions (different hash, same purpose) go in `bios/Manufacturer/Console/.variants/`.
3. Files over 50 MB go as assets on the `large-files` GitHub release (git handles them better that way).
4. RPG Maker and ScummVM directories are excluded from deduplication - please keep their structure as-is.
5. Open a pull request. CI validates checksums automatically and posts a report.
## Commit conventions
Format: `type: description` (50 characters max, lowercase start).
Allowed types: `feat`, `refactor`, `chore`, `docs`, `fix`.
```
feat: add panasonic 3do bios files
docs: update architecture diagram
fix: resolve truncated md5 matching
chore: remove unused test fixtures
refactor: extract hash logic to common.py
```
Keep messages factual. No marketing language, no superfluous adjectives.
## Code and documentation quality
The codebase runs on Python 3.10+ with a single dependency (`pyyaml`). All modules
include `from __future__ import annotations` at the top. Type hints on every function
signature, `pathlib` instead of `os.path`, and dataclasses where a plain class would
just hold attributes.
On performance: O(1) or O(n) algorithms are preferred. If something needs O(n^2), a
comment explaining why helps future readers. List comprehensions over explicit loops,
generators when iterating large datasets, and standard default arguments
(`def f(items=None)` over `def f(items=[])`).
File I/O uses context managers. ZIP extraction goes through `safe_extract_zip()` in
`common.py`, which prevents zip-slip path traversal.
The code stays lean. Comments that describe *why* age better than comments that
restate *what*. Unused variables can be deleted rather than renamed with an underscore.
The same spirit applies to documentation and emulator profiles. Straightforward
language, honest labels ("untested" when something is untested).
When a bug is spotted while working on something else, fixing it in the same PR
keeps things tidy. Features ideally ship complete in one pass: code, tests, a
passing pipeline run, and any documentation updates together.
## Pull request process
CI runs four checks on every PR:
| Check | What it does |
|-------|-------------|
| `validate-bios` | hashes changed BIOS files against the database, posts a validation report |
| `validate-configs` | schema-validates platform YAML configs |
| `run-tests` | runs the full E2E test suite |
| `label-pr` | auto-labels the PR based on changed file paths |
All four checks must pass. One maintainer approval is required before merge.
## Project philosophy
This repository is an archive. The goal is a complete collection of every file a
user needs to run every supported console in an emulator, with zero friction. That
scope goes beyond BIOS ROMs: firmware updates, system fonts, data files, encryption
keys, hiscore databases, anything the emulator loads from disk rather than generating
itself. In a hundred years the pack should still work out of the box.
Licensing considerations are secondary to completeness. If an emulator needs a
file to function, the file belongs in the collection. The project's legal position is
documented in the FAQ section of the site.
Integrity matters. Every file, hash, and metadata field is cross-checked
against the emulator's source code. Upstream references like System.dat,
`.info` files, and wiki pages are valuable and generally accurate, though
they can occasionally fall out of date. When an upstream source and the
code disagree, the code at runtime is the tiebreaker.
## Documentation
Full reference docs, profiling guides, and architecture details are on the [documentation site](https://abdess.github.io/retrobios/).

View File

@@ -2,7 +2,7 @@
Complete BIOS and firmware packs for Batocera, BizHawk, EmuDeck, Lakka, Recalbox, RetroArch, RetroBat, RetroDECK, RetroPie, and RomM.
**7,239** verified files across **396** systems, ready to extract into your emulator's BIOS directory.
**7,241** verified files across **396** systems, ready to extract into your emulator's BIOS directory.
## Quick Install
@@ -46,7 +46,7 @@ Each file is checked against the emulator's source code to match what the code a
- **10 platforms** supported with platform-specific verification
- **329 emulators** profiled from source (RetroArch cores + standalone)
- **396 systems** covered (NES, SNES, PlayStation, Saturn, Dreamcast, ...)
- **7,239 files** verified with MD5, SHA1, CRC32 checksums
- **7,241 files** verified with MD5, SHA1, CRC32 checksums
- **8144 MB** total collection size
## Supported systems
@@ -65,7 +65,7 @@ Full list with per-file details: **[https://abdess.github.io/retrobios/](https:/
| Lakka | 442/448 (98.7%) | 442 | 0 | 6 |
| Recalbox | 277/346 (80.1%) | 274 | 3 | 69 |
| RetroArch | 442/448 (98.7%) | 442 | 0 | 6 |
| RetroBat | 337/339 (99.4%) | 333 | 4 | 2 |
| RetroBat | 339/339 (100.0%) | 335 | 4 | 0 |
| RetroDECK | 1960/2006 (97.7%) | 1934 | 26 | 46 |
| RetroPie | 442/448 (98.7%) | 442 | 0 | 6 |
| RomM | 372/374 (99.5%) | 372 | 0 | 2 |
@@ -130,4 +130,4 @@ See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
This repository provides BIOS files for personal backup and archival purposes.
*Auto-generated on 2026-03-30T12:09:51Z*
*Auto-generated on 2026-03-30T20:16:27Z*

View File

@@ -132,7 +132,7 @@ nav:
- ZC: systems/zc.md
- Emulators:
- Overview: emulators/index.md
- Official ports (62):
- Official ports (63):
- amiarcadia: emulators/amiarcadia.md
- Amiberry: emulators/amiberry.md
- Ardens: emulators/ardens.md
@@ -180,6 +180,7 @@ nav:
- PicoDrive: emulators/picodrive.md
- play: emulators/play.md
- PPSSPP: emulators/ppsspp.md
- RPCS3: emulators/rpcs3.md
- Rustation: emulators/rustation.md
- RVVM: emulators/rvvm.md
- SameBoy: emulators/sameboy.md
@@ -429,7 +430,7 @@ nav:
- PCSX-ReARMed: emulators/pcsx_rearmed.md
- Launchers (1):
- Dolphin Launcher: emulators/dolphin_launcher.md
- Other (24):
- Other (23):
- ares: emulators/ares.md
- Beetle GBA (Mednafen): emulators/beetle_gba.md
- BigPEmu: emulators/bigpemu.md
@@ -442,7 +443,6 @@ nav:
- Model 2 Emulator: emulators/model2.md
- openMSX: emulators/openmsx.md
- Redream: emulators/redream.md
- RPCS3: emulators/rpcs3.md
- Ryujinx: emulators/ryujinx.md
- shadps4: emulators/shadps4.md
- Supermodel: emulators/supermodel.md
@@ -458,8 +458,17 @@ nav:
- Gap Analysis: gaps.md
- Wiki:
- Overview: wiki/index.md
- Getting started: wiki/getting-started.md
- FAQ: wiki/faq.md
- Troubleshooting: wiki/troubleshooting.md
- Architecture: wiki/architecture.md
- Tools: wiki/tools.md
- Profiling guide: wiki/profiling.md
- Advanced usage: wiki/advanced-usage.md
- Verification modes: wiki/verification-modes.md
- Data model: wiki/data-model.md
- Profiling guide: wiki/profiling.md
- Adding a platform: wiki/adding-a-platform.md
- Adding a scraper: wiki/adding-a-scraper.md
- Testing guide: wiki/testing-guide.md
- Release process: wiki/release-process.md
- Contributing: contributing.md

View File

@@ -683,6 +683,8 @@ def load_emulator_profiles(
if not emu_path.exists():
return profiles
for f in sorted(emu_path.glob("*.yml")):
if f.name.endswith(".old.yml"):
continue
with open(f) as fh:
profile = yaml.safe_load(fh) or {}
if "emulator" not in profile:

View File

@@ -401,6 +401,8 @@ def _collect_all_aliases(files: dict) -> dict:
try:
import yaml
for emu_file in emulators_dir.glob("*.yml"):
if emu_file.name.endswith(".old.yml"):
continue
try:
with open(emu_file) as f:
emu_config = yaml.safe_load(f) or {}

View File

@@ -91,6 +91,7 @@ def generate_readme(db: dict, platforms_dir: str) -> str:
emulator_count = sum(
1 for f in Path("emulators").glob("*.yml")
if not f.name.endswith(".old.yml")
) if Path("emulators").exists() else 0
# Count systems from emulator profiles
@@ -100,6 +101,8 @@ def generate_readme(db: dict, platforms_dir: str) -> str:
try:
import yaml
for f in emu_dir.glob("*.yml"):
if f.name.endswith(".old.yml"):
continue
with open(f) as fh:
p = yaml.safe_load(fh) or {}
system_ids.update(p.get("systems", []))

View File

@@ -1358,6 +1358,8 @@ The CI automatically:
# Wiki pages
# index, architecture, tools, profiling are maintained as wiki/ sources
# and copied verbatim by main(). Only data-model is generated dynamically.
def generate_wiki_index() -> str:
"""Generate wiki landing page."""
@@ -1994,10 +1996,19 @@ def generate_mkdocs_nav(
wiki_nav = [
{"Overview": "wiki/index.md"},
{"Getting started": "wiki/getting-started.md"},
{"FAQ": "wiki/faq.md"},
{"Troubleshooting": "wiki/troubleshooting.md"},
{"Architecture": "wiki/architecture.md"},
{"Tools": "wiki/tools.md"},
{"Profiling guide": "wiki/profiling.md"},
{"Advanced usage": "wiki/advanced-usage.md"},
{"Verification modes": "wiki/verification-modes.md"},
{"Data model": "wiki/data-model.md"},
{"Profiling guide": "wiki/profiling.md"},
{"Adding a platform": "wiki/adding-a-platform.md"},
{"Adding a scraper": "wiki/adding-a-scraper.md"},
{"Testing guide": "wiki/testing-guide.md"},
{"Release process": "wiki/release-process.md"},
]
return [
@@ -2180,7 +2191,7 @@ plugins:
+ 1 # cross-reference
+ 1 + len(profiles) # emulator index + detail
+ 1 # gap analysis
+ 5 # wiki (index, architecture, tools, profiling, data model)
+ 14 # wiki pages (copied from wiki/ + generated data-model)
+ 1 # contributing
)
print(f"\nGenerated {total_pages} pages in {args.docs_dir}/")

352
wiki/adding-a-platform.md Normal file
View File

@@ -0,0 +1,352 @@
# Adding a platform
How to add support for a new retrogaming platform (e.g. a frontend like Batocera,
a manager like EmuDeck, or a firmware database like BizHawk).
## Prerequisites
Before starting, gather the following from the upstream project:
- **Where does it define BIOS requirements?** Each platform has a canonical source:
a DAT file, a JSON fixture, an XML manifest, a Bash script, a C# database, etc.
- **What verification mode does it use?** Read the platform source code to determine
how it checks BIOS files at runtime: file existence only (`existence`), MD5 hash
matching (`md5`), SHA1 matching (`sha1`), or a combination of size and hash.
- **What is the base destination?** The directory name where BIOS files are placed
on disk (e.g. `system` for RetroArch, `bios` for Batocera, `Firmware` for BizHawk).
- **What hash type does it store?** The primary hash format used in the platform's
own data files (SHA1 for RetroArch/BizHawk, MD5 for Batocera/Recalbox/EmuDeck).
## Step 1: Create the scraper
Scrapers live in `scripts/scraper/` and are auto-discovered by the plugin system.
Any file matching `*_scraper.py` in that directory is loaded at import time via
`pkgutil.iter_modules`. No registration step is needed beyond placing the file.
### Module contract
The module must export two names:
```python
PLATFORM_NAME = "myplatform" # matches the key in _registry.yml
class Scraper(BaseScraper):
...
```
### Inheriting BaseScraper
`BaseScraper` provides:
- `_fetch_raw() -> str` - HTTP GET with 50 MB response limit, cached after first call.
Uses `urllib.request` with a `retrobios-scraper/1.0` user-agent and 30s timeout.
- `compare_with_config(config) -> ChangeSet` - diffs scraped requirements against
an existing platform YAML, returning added/removed/modified entries.
- `test_connection() -> bool` - checks if the source URL is reachable.
Two abstract methods must be implemented:
```python
def fetch_requirements(self) -> list[BiosRequirement]:
"""Parse the upstream source and return one BiosRequirement per file."""
def validate_format(self, raw_data: str) -> bool:
"""Return False if the upstream format has changed unexpectedly."""
```
### BiosRequirement fields
| Field | Type | Description |
|-------|------|-------------|
| `name` | `str` | Filename as the platform expects it |
| `system` | `str` | Retrobios system ID (e.g. `sony-playstation`) |
| `sha1` | `str \| None` | SHA1 hash if available |
| `md5` | `str \| None` | MD5 hash if available |
| `crc32` | `str \| None` | CRC32 if available |
| `size` | `int \| None` | Expected file size in bytes |
| `destination` | `str` | Relative path within the BIOS directory |
| `required` | `bool` | Whether the platform considers this file mandatory |
| `zipped_file` | `str \| None` | If set, the hash refers to a ROM inside a ZIP |
| `native_id` | `str \| None` | Original system name before normalization |
### System ID mapping
Every scraper needs a mapping from the platform's native system identifiers to
retrobios system IDs. Define this as a module-level dict:
```python
SLUG_MAP: dict[str, str] = {
"psx": "sony-playstation",
"saturn": "sega-saturn",
...
}
```
Warn on unmapped slugs so new systems are surfaced during scraping.
### generate_platform_yaml (optional)
If the scraper defines a `generate_platform_yaml() -> dict` method, the shared
CLI will use it instead of the generic YAML builder. This allows the scraper to
include platform metadata (homepage, version, inherits, cores list) in the output.
### CLI entry point
Add a `main()` function and `__main__` guard:
```python
def main():
from scripts.scraper.base_scraper import scraper_cli
scraper_cli(Scraper, "Scrape MyPlatform BIOS requirements")
if __name__ == "__main__":
main()
```
`scraper_cli` provides `--dry-run`, `--json`, and `--output` flags automatically.
### Test the scraper
```bash
python -m scripts.scraper.myplatform_scraper --dry-run
```
This fetches from upstream and prints a summary without writing anything.
## Step 2: Register the platform
Add an entry to `platforms/_registry.yml` under the `platforms:` key.
### Required fields
```yaml
platforms:
myplatform:
config: myplatform.yml # platform YAML filename in platforms/
status: active # active or archived
scraper: myplatform # matches PLATFORM_NAME in the scraper
source_url: https://... # upstream data URL
source_format: json # json, xml, clrmamepro_dat, python_dict, bash_script+csv, csharp_firmware_database, github_component_manifests
hash_type: md5 # primary hash in the upstream data
verification_mode: md5 # how the platform checks files: existence, md5, sha1
base_destination: bios # where files go on disk
cores: # which emulator profiles apply
- core_a
- core_b
```
The `cores` field determines which emulator profiles are resolved for this platform.
Three strategies exist:
- **Explicit list**: `cores: [beetle_psx, dolphin, ...]` - match by profile key name.
Used by Batocera, Recalbox, RetroBat, RomM.
- **all_libretro**: `cores: all_libretro` - include every profile with `type: libretro`
or `type: standalone + libretro`. Used by RetroArch, Lakka, RetroPie.
- **Omitted**: fallback to system ID intersection. Used by EmuDeck.
### Optional fields
```yaml
logo: https://... # SVG or PNG for UI/docs
schedule: weekly # scrape frequency: weekly, monthly, or null
inherits_from: retroarch # inherit systems/cores from another platform
case_insensitive_fs: true # if the platform runs on case-insensitive filesystems
target_scraper: myplatform_targets # hardware target scraper name
target_source: https://... # target data source URL
install:
detect: # auto-detection for install.py
- os: linux
method: config_file
config: $HOME/.config/myplatform/config.ini
parse_key: bios_directory
```
### Inheritance
If the new platform inherits from an existing one (e.g. Lakka inherits RetroArch),
set `inherits_from` in the registry AND add `inherits: retroarch` in the platform
YAML itself. `load_platform_config()` reads the `inherits:` field from the YAML to
merge parent systems and shared groups into the child. The child YAML only needs to
declare overrides.
## Step 3: Generate the platform YAML
Run the scraper with `--output` to produce the initial platform configuration:
```bash
python -m scripts.scraper.myplatform_scraper --output platforms/myplatform.yml
```
If a file already exists at the output path, the CLI preserves fields that the
scraper does not generate (e.g. `data_directories`, manually added metadata).
Only the `systems` section is replaced.
Verify the result:
```bash
python scripts/verify.py --platform myplatform
python scripts/verify.py --platform myplatform --verbose
```
## Step 4: Add verification logic
Check how the platform verifies BIOS files by reading its source code.
The `verification_mode` in the registry tells `verify.py` which strategy to use:
| Mode | Behavior | Example platforms |
|------|----------|-------------------|
| `existence` | File must exist, no hash check | RetroArch, Lakka, RetroPie |
| `md5` | MD5 must match the declared hash | Batocera, Recalbox, RetroBat, EmuDeck, RetroDECK |
| `sha1` | SHA1 must match | BizHawk |
If the platform has unique verification behavior (e.g. Batocera's `checkInsideZip`,
Recalbox's multi-hash comma-separated MD5, RomM's size + any-hash), add the logic
to `verify.py` in the platform-specific verification path.
Read the platform's source code to understand its exact verification behavior before writing any logic. Batocera's `checkInsideZip` uses `casefold()` for case-insensitive matching. Recalbox supports comma-separated MD5 lists. RomM checks file size before hashing. These details matter: the project replicates native behavior, not an approximation of it.
## Step 5: Create an exporter (optional)
Exporters convert truth data back to the platform's native format. They live in
`scripts/exporter/` and follow the same auto-discovery pattern (`*_exporter.py`).
### Module contract
The module must export an `Exporter` class inheriting `BaseExporter`:
```python
from scripts.exporter.base_exporter import BaseExporter
class Exporter(BaseExporter):
@staticmethod
def platform_name() -> str:
return "myplatform"
def export(self, truth_data: dict, output_path: str, scraped_data: dict | None = None) -> None:
# Write truth_data in the platform's native format to output_path
...
def validate(self, truth_data: dict, output_path: str) -> list[str]:
# Return a list of issues (empty = valid)
...
```
`BaseExporter` provides helper methods:
- `_is_pattern(name)` - True if the filename contains wildcards or placeholders.
- `_dest(fe)` - resolve destination path from a file entry dict.
- `_display_name(sys_id, scraped_sys)` - convert a system slug to a display name.
### Round-trip validation
The exporter enables a scrape-export-compare workflow:
```bash
# Scrape upstream
python -m scripts.scraper.myplatform_scraper --output /tmp/scraped.yml
# Export truth data
python scripts/export_native.py --platform myplatform --output /tmp/exported.json
# Compare exported file with upstream
diff /tmp/scraped.yml /tmp/exported.json
```
## Step 6: Create a target scraper (optional)
Target scrapers determine which emulator cores are available on each hardware
target (e.g. which RetroArch cores exist for Switch, RPi4, or x86_64).
They live in `scripts/scraper/targets/` and are auto-discovered by filename
(`*_targets_scraper.py`).
### Module contract
```python
from scripts.scraper.targets import BaseTargetScraper
PLATFORM_NAME = "myplatform_targets"
class Scraper(BaseTargetScraper):
def fetch_targets(self) -> dict:
return {
"platform": "myplatform",
"source": "https://...",
"scraped_at": "2026-03-30T00:00:00Z",
"targets": {
"x86_64": {
"architecture": "x86_64",
"cores": ["beetle_psx", "dolphin", "..."],
},
"rpi4": {
"architecture": "aarch64",
"cores": ["pcsx_rearmed", "mgba", "..."],
},
},
}
```
Add `target_scraper` and `target_source` to the platform's registry entry.
### Overrides
Hardware-specific overrides go in `platforms/targets/_overrides.yml`. This file
defines aliases (e.g. `arm64` maps to `aarch64`) and per-platform core
additions/removals that the scraper cannot determine automatically.
### Single-target platforms
For platforms that only run on one target (e.g. RetroBat on Windows, RomM in the
browser), create a static YAML file in `platforms/targets/` instead of a scraper.
Set `target_scraper: null` in the registry.
## Step 7: Add install detection (optional)
The `install` section in `_registry.yml` tells `install.py` how to detect
the platform on the user's machine and locate its BIOS directory.
Three detection methods are available:
| Method | Description | Fields |
|--------|-------------|--------|
| `config_file` | Parse a key from a config file | `config`, `parse_key`, optionally `bios_subdir` |
| `path_exists` | Check if a directory exists | `path`, optionally `bios_path` |
| `file_exists` | Check if a file exists | `file`, optionally `bios_path` |
Each entry is scoped to an OS (`linux`, `darwin`, `windows`). Multiple entries
per OS are tried in order.
## Step 8: Validate the full pipeline
After all pieces are in place, run the full pipeline:
```bash
python scripts/pipeline.py --offline
```
This executes in sequence:
1. `generate_db.py` - rebuild `database.json` from `bios/`
2. `refresh_data_dirs.py` - update data directories
3. `verify.py --all` - verify all platforms including the new one
4. `generate_pack.py --all` - build ZIP packs
5. Consistency check - verify counts match between verify and pack
Check the output for:
- The new platform appears in verify results
- No unexpected CRITICAL or WARNING entries
- Pack generation succeeds and includes the expected files
- Consistency check passes (verify file counts match pack file counts)
Verification is not optional. A platform that passes `pipeline.py` today may break tomorrow if upstream changes its data format. Run the full pipeline on every change, even if the modification seems trivial. The consistency check (verify counts must match pack counts) catches subtle issues where files resolve during verification but fail during pack generation, or vice versa.
## Checklist
- [ ] Scraper file in `scripts/scraper/<name>_scraper.py`
- [ ] `PLATFORM_NAME` and `Scraper` class exported
- [ ] `fetch_requirements()` and `validate_format()` implemented
- [ ] System ID mapping covers all upstream systems
- [ ] Entry added to `platforms/_registry.yml`
- [ ] Platform YAML generated and verified
- [ ] `python scripts/pipeline.py --offline` passes
- [ ] Exporter in `scripts/exporter/<name>_exporter.py` (if applicable)
- [ ] Target scraper in `scripts/scraper/targets/<name>_targets_scraper.py` (if applicable)
- [ ] Install detection entries in `_registry.yml` (if applicable)

423
wiki/adding-a-scraper.md Normal file
View File

@@ -0,0 +1,423 @@
# Adding a scraper
How to create or modify a scraper for fetching BIOS requirements from upstream
platform sources.
## Scraper architecture
### Plugin discovery
Scrapers are discovered automatically at import time. The `scripts/scraper/__init__.py`
module uses `pkgutil.iter_modules` to scan for files matching `*_scraper.py` in
the scraper directory. Each module must export:
- `PLATFORM_NAME: str` - the platform identifier (matches `_registry.yml`)
- `Scraper: class` - a subclass of `BaseScraper`
No registration code is needed. Drop a file, export the two names, and it works.
```python
# scripts/scraper/__init__.py (simplified)
for finder, name, ispkg in pkgutil.iter_modules([package_dir]):
if not name.endswith("_scraper"):
continue
module = importlib.import_module(f".{name}", package=__package__)
# looks for PLATFORM_NAME and Scraper attributes
```
### BaseScraper ABC
`BaseScraper` (`scripts/scraper/base_scraper.py`) provides the foundation:
```
BaseScraper
__init__(url: str)
_fetch_raw() -> str # HTTP GET, cached, 50 MB limit
fetch_requirements() -> list # abstract: parse upstream data
validate_format(raw_data) -> bool # abstract: detect format changes
compare_with_config(config) -> ChangeSet # diff against existing YAML
test_connection() -> bool # reachability check
```
`_fetch_raw()` handles HTTP with `urllib.request`, sets a `retrobios-scraper/1.0`
user-agent, enforces a 30-second timeout, and reads the response in 64 KB chunks
with a 50 MB hard limit to prevent memory exhaustion. The result is cached on the
instance after the first call.
### BiosRequirement
A dataclass representing a single BIOS file entry:
```python
@dataclass
class BiosRequirement:
name: str # filename
system: str # retrobios system ID
sha1: str | None = None
md5: str | None = None
crc32: str | None = None
size: int | None = None
destination: str = "" # relative path in BIOS dir
required: bool = True
zipped_file: str | None = None # ROM name inside a ZIP
native_id: str | None = None # original system name
```
### ChangeSet
Returned by `compare_with_config()`. Contains:
- `added: list[BiosRequirement]` - new files not in the existing config
- `removed: list[BiosRequirement]` - files present in config but gone upstream
- `modified: list[tuple[BiosRequirement, BiosRequirement]]` - hash changes
- `has_changes: bool` - True if any of the above are non-empty
- `summary() -> str` - human-readable summary (e.g. `+3 added, ~1 modified`)
### scraper_cli
`scraper_cli(scraper_class, description)` provides a shared CLI with three modes:
| Flag | Behavior |
|------|----------|
| `--dry-run` | Fetch and print a summary grouped by system |
| `--json` | Output all requirements as JSON |
| `--output FILE` | Write platform YAML to FILE |
When `--output` targets an existing file, the CLI preserves keys not generated by
the scraper (e.g. `data_directories`, manual additions). Only the `systems` section
is replaced. If the scraper defines `generate_platform_yaml()`, that method is used
instead of the generic YAML builder.
### Helper functions
Two additional functions in `base_scraper.py`:
- `fetch_github_latest_version(repo)` - fetches the latest release tag via GitHub API.
- `fetch_github_latest_tag(repo, prefix)` - fetches the most recent tag matching
an optional prefix.
## Creating a BIOS scraper
### Minimal example
Based on the RomM scraper pattern (JSON source, flat structure):
```python
"""Scraper for MyPlatform BIOS requirements."""
from __future__ import annotations
import json
import sys
try:
from .base_scraper import BaseScraper, BiosRequirement
except ImportError:
from base_scraper import BaseScraper, BiosRequirement
PLATFORM_NAME = "myplatform"
SOURCE_URL = "https://raw.githubusercontent.com/org/repo/main/bios_list.json"
SLUG_MAP: dict[str, str] = {
"psx": "sony-playstation",
"saturn": "sega-saturn",
}
class Scraper(BaseScraper):
def __init__(self, url: str = SOURCE_URL):
super().__init__(url=url)
def fetch_requirements(self) -> list[BiosRequirement]:
raw = self._fetch_raw()
if not self.validate_format(raw):
raise ValueError("Format validation failed")
data = json.loads(raw)
requirements = []
for entry in data:
system = SLUG_MAP.get(entry["platform"])
if not system:
print(f"Warning: unmapped '{entry['platform']}'", file=sys.stderr)
continue
requirements.append(BiosRequirement(
name=entry["filename"],
system=system,
md5=entry.get("md5"),
sha1=entry.get("sha1"),
size=entry.get("size"),
destination=entry["filename"],
required=entry.get("required", True),
))
return requirements
def validate_format(self, raw_data: str) -> bool:
try:
data = json.loads(raw_data)
except (json.JSONDecodeError, TypeError):
return False
return isinstance(data, list) and len(data) > 0
def main():
from scripts.scraper.base_scraper import scraper_cli
scraper_cli(Scraper, "Scrape MyPlatform BIOS requirements")
if __name__ == "__main__":
main()
```
### Parsing different upstream formats
Each platform stores its BIOS requirements differently. The scraper's job is to
normalize them into `BiosRequirement` entries.
| Format | Example | Parsing approach |
|--------|---------|-----------------|
| JSON | RomM `known_bios_files.json` | `json.loads()`, iterate keys |
| XML | Recalbox `es_bios.xml` | `xml.etree.ElementTree`, xpath or iter |
| clrmamepro DAT | RetroArch `System.dat` | Use `dat_parser` module (see below) |
| Python dict | Batocera `batocera-systems` | `ast.literal_eval` or regex extraction |
| Bash script | EmuDeck `checkBIOS.sh` | Line-by-line regex parsing |
| C# source | BizHawk `FirmwareDatabase.cs` | Regex for method calls and string literals |
| C source | MAME/FBNeo drivers | Use `mame_parser` or `fbneo_parser` (see below) |
| JSON (GitHub API) | RetroDECK component manifests | `json.loads()` per manifest file |
### System ID mapping
Every scraper maintains a `SLUG_MAP` (or equivalent) that translates the platform's
native system identifiers to retrobios system IDs. The retrobios system ID format
is `manufacturer-console` in lowercase with hyphens (e.g. `sony-playstation`,
`sega-mega-drive`, `nintendo-gba`).
When a native slug has no mapping, print a warning to stderr. This surfaces new
systems added upstream that need to be mapped.
System ID consistency matters for cross-platform operations. The same console must use the same ID across all scrapers and platforms. Before inventing a new ID, check existing profiles and platform YAMLs for precedent. The canonical format is `manufacturer-console` in lowercase with hyphens (e.g., `sony-playstation`, `sega-mega-drive`). The `SYSTEM_ALIASES` dict in `common.py` maps common variations to canonical IDs.
### Hash normalization
- Normalize all hashes to lowercase hex strings.
- Handle missing hashes gracefully (set to `None`, not empty string).
- Some platforms provide multiple hash types per entry. Populate whichever fields
are available.
- Batocera uses 29-character truncated MD5 hashes in some entries. The resolution
layer handles prefix matching, but the scraper should store the hash as-is.
Scraped data reflects what the upstream declares, which may not match reality. The scraper's job is faithful transcription of upstream data, not correction. Corrections happen in the emulator profiles (source-verified) and in `_shared.yml` (curated). If a scraper detects an obviously wrong hash or filename, log a warning but still include the upstream value. The divergence will surface during truth diffing.
## Creating a target scraper
Target scrapers determine which emulator cores are available on each hardware
target. They live in `scripts/scraper/targets/` and follow the same auto-discovery
pattern (`*_targets_scraper.py`).
### BaseTargetScraper ABC
```
BaseTargetScraper
__init__(url: str)
fetch_targets() -> dict # abstract: return target data
write_output(data, path) # write YAML to disk
```
### Output format
`fetch_targets()` must return a dict with this structure:
```python
{
"platform": "myplatform",
"source": "https://...",
"scraped_at": "2026-03-30T12:00:00Z",
"targets": {
"x86_64": {
"architecture": "x86_64",
"cores": ["beetle_psx", "dolphin", "snes9x"],
},
"rpi4": {
"architecture": "aarch64",
"cores": ["pcsx_rearmed", "mgba"],
},
},
}
```
The `targets` dict maps target names to their available cores. Core names must
match the names used in emulator profile `cores:` fields for the target filtering
pipeline to work correctly.
### Overrides
`platforms/targets/_overrides.yml` provides post-scrape adjustments:
- **aliases**: map alternate target names to canonical ones (e.g. `arm64` -> `aarch64`)
- **add_cores**: cores present on a target but not detected by the scraper
- **remove_cores**: cores detected by the scraper but not actually functional
Overrides are applied by `load_target_config()` in `common.py` after loading
the scraped data. The scraper itself does not need to handle overrides.
### Module contract
```python
from scripts.scraper.targets import BaseTargetScraper
PLATFORM_NAME = "myplatform_targets"
class Scraper(BaseTargetScraper):
def __init__(self):
super().__init__(url="https://...")
def fetch_targets(self) -> dict:
# Fetch and parse target data
...
```
Register the target scraper in `_registry.yml`:
```yaml
myplatform:
target_scraper: myplatform_targets
target_source: https://...
```
### Existing target scrapers
| Scraper | Source | Approach |
|---------|--------|----------|
| `retroarch_targets` | libretro buildbot nightly | Scrape directory listings for each target arch |
| `batocera_targets` | Config.in + es_systems.yml | Cross-reference kernel config with system definitions |
| `emudeck_targets` | EmuScripts + RetroArch cores | GitHub API for script availability per OS |
| `retropie_targets` | scriptmodules + rp_module_flags | Parse Bash scriptmodules for platform flags |
## Parser modules
Shared parsers in `scripts/scraper/` handle formats used by multiple scrapers
or formats complex enough to warrant dedicated parsing logic.
### dat_parser
Parses clrmamepro DAT format as used in RetroArch's `System.dat`:
```
game (
name "System"
comment "Platform Name"
rom ( name filename size 12345 crc ABCD1234 md5 ... sha1 ... )
)
```
Produces `DatRom` dataclass instances with `name`, `size`, `crc32`, `md5`, `sha1`,
and `system` fields. The `libretro_scraper` uses this parser.
### mame_parser
Parses MAME C source files to extract BIOS root sets. Handles:
- Machine declaration macros: `GAME`, `SYST`, `COMP`, `CONS`
- `MACHINE_IS_BIOS_ROOT` flag detection
- `ROM_START`/`ROM_END` blocks
- `ROM_LOAD` variants and `ROM_REGION` declarations
- `ROM_SYSTEM_BIOS` entries
- `NO_DUMP` filtering and `BAD_DUMP` flagging
Used by `mame_hash_scraper` to auto-fetch BIOS hashes from MAME driver sources
for each tagged MAME version.
### fbneo_parser
Parses FBNeo C source files:
- `BurnRomInfo` structs (static ROM arrays with name, size, CRC)
- `BurnDriver` structs (driver registration with `BDF_BOARDROM` flag)
- BIOS set identification via the boardrom flag
Used by `fbneo_hash_scraper` to extract BIOS ROM definitions.
### _hash_merge
Text-based YAML patching that merges fetched hash data into emulator profiles
while preserving formatting. Two strategies:
- **MAME**: updates `bios_zip` entries with `contents` lists (name, size, CRC32)
- **FBNeo**: updates individual ROM entries grouped by `archive` field
The merge preserves fields the hash data does not generate (system, note, required)
and leaves entries not present in the hash data untouched. Uses text-level YAML
manipulation rather than load-dump to maintain human-readable formatting.
## Testing
### Development workflow
1. **Start with --dry-run**. It's helpful to preview before writing output:
```bash
python -m scripts.scraper.myplatform_scraper --dry-run
```
2. **Check JSON output** for data quality:
```bash
python -m scripts.scraper.myplatform_scraper --json | python -m json.tool | head -50
```
3. **Compare with existing YAML** if updating a scraper:
```bash
python -m scripts.scraper.myplatform_scraper --output /tmp/test.yml
diff platforms/myplatform.yml /tmp/test.yml
```
4. **Run verification** after generating:
```bash
python scripts/verify.py --platform myplatform
python scripts/verify.py --platform myplatform --verbose
```
5. **Run the full pipeline** before committing:
```bash
python scripts/pipeline.py --offline
```
### Round-trip testing
If an exporter exists for the platform, validate the scrape-export-compare cycle:
```bash
# Scrape upstream -> platform YAML
python -m scripts.scraper.myplatform_scraper --output /tmp/scraped.yml
# Export truth data -> native format
python scripts/export_native.py --platform myplatform --output /tmp/exported.json
# Compare
diff <(python -m scripts.scraper.myplatform_scraper --json | python -m json.tool) \
/tmp/exported.json
```
### Common issues
| Symptom | Cause | Fix |
|---------|-------|-----|
| Unmapped slug warnings | New system added upstream | Add mapping to `SLUG_MAP` |
| Empty requirements list | Upstream format changed | Check `validate_format()`, update parser |
| Hash mismatch in verify | Upstream updated hashes | Re-scrape and regenerate platform YAML |
| Scraper hangs | URL unreachable, no timeout | `_fetch_raw()` has 30s timeout; check URL |
| `Response exceeds 50 MB` | Upstream file grew | Investigate; may need chunked parsing |
| `validate_format` fails | Upstream restructured | Update both `validate_format` and `fetch_requirements` |
### E2E tests
The project's test suite (`tests/test_e2e.py`) covers scraper integration at the
pipeline level. When adding a new scraper, verify that the full pipeline passes:
```bash
python -m unittest tests.test_e2e
python scripts/pipeline.py --offline
```
Both must pass before the scraper is considered complete.

280
wiki/advanced-usage.md Normal file
View File

@@ -0,0 +1,280 @@
# Advanced Usage
Fine-grained control over pack generation, hardware filtering, truth analysis, and verification.
## Custom Packs
### Build from hash
Look up a single MD5 in the database:
```bash
python scripts/generate_pack.py --from-md5 d8f1206299c48946e6ec5ef96d014eaa
```
Build a pack containing only files matching hashes from a list (one MD5 per line, `#` for comments):
```bash
python scripts/generate_pack.py --platform batocera --from-md5-file missing.txt
```
This is useful when a platform reports missing files and you want to generate a targeted pack
rather than re-downloading the full archive.
### Split packs
Generate one ZIP per system instead of a single monolithic pack:
```bash
python scripts/generate_pack.py --platform retroarch --split
```
Group the split ZIPs by manufacturer (Sony, Nintendo, Sega, etc.):
```bash
python scripts/generate_pack.py --platform retroarch --split --group-by manufacturer
```
### System-specific packs
Extract only the files for a single system within a platform:
```bash
python scripts/generate_pack.py --platform retroarch --system sony-playstation
```
### Required only
Exclude optional files from the pack:
```bash
python scripts/generate_pack.py --platform batocera --required-only
```
What counts as "required" depends on the platform YAML. For existence-mode platforms
(RetroArch), the distinction comes from the `.info` file's `required` field.
For MD5-mode platforms (Batocera), all declared files are treated as required unless
explicitly marked optional.
## Hardware Target Filtering
### What targets are
A target represents a hardware architecture where a platform runs. Each architecture
has a different set of available cores. For example, the RetroArch Switch target
has fewer cores than the x86_64 target because some cores are not ported to ARM.
Target data is scraped from upstream sources (buildbot nightly listings, board configs,
scriptmodules) and stored in `platforms/targets/<platform>.yml`.
### Usage
Filter packs or verification to only include systems reachable by cores available
on the target hardware:
```bash
python scripts/generate_pack.py --platform retroarch --target switch
python scripts/generate_pack.py --all --target x86_64
python scripts/verify.py --platform batocera --target rpi4
```
When combined with `--all`, platforms that define the target are filtered. Platforms
without a target file for that name are left unfiltered (no information to exclude anything).
Platforms that have target data but not the requested target are skipped with an INFO message.
### How it works
The filtering pipeline has three stages:
1. **`load_target_config()`** reads `platforms/targets/<platform>.yml` and returns
the set of cores available on the target. Aliases from `_overrides.yml` are resolved
(e.g., `--target rpi4` may match `bcm2711` in the target file).
2. **`resolve_platform_cores()`** determines which emulator profiles are relevant
for the platform, then intersects the result with the target's core set. The
intersection uses a reverse index built from each profile's `cores:` field, so
that upstream names (e.g., `mednafen_psx` on the buildbot) map to profile keys
(e.g., `beetle_psx`).
3. **`filter_systems_by_target()`** removes platform systems where every core that
emulates them is absent from the target. Systems with no core information are kept
(benefit of the doubt). System ID normalization strips manufacturer prefixes and
separators so that `xbox` matches `microsoft-xbox`.
### List available targets
```bash
python scripts/verify.py --platform retroarch --list-targets
```
### Overrides
`platforms/targets/_overrides.yml` provides two mechanisms:
- **Aliases**: map user-facing names to internal target IDs
(e.g., `rpi4` -> `bcm2711`).
- **add/remove cores**: patch the scraped core list for a specific target
without overwriting the entire file. Useful when a core is known to work
but is not listed on the buildbot, or vice versa.
### Single-target platforms
Platforms with only one target (e.g., RetroBat with `windows`, RomM with `browser`)
treat `--target <their-only-target>` as a no-op: the output is identical to running
without `--target`.
## Truth Generation and Diffing
### What truth is
Truth data is ground truth generated from emulator profiles. It represents what each
core actually needs based on source code analysis, independent of what platform
scrapers declare. The purpose is gap analysis: finding files that platforms miss
or declare incorrectly.
### Generate truth
Build truth YAMLs from emulator profiles for a platform or all platforms:
```bash
python scripts/generate_truth.py --platform retroarch
python scripts/generate_truth.py --all --output-dir dist/truth/
```
Each truth YAML lists every system with its files, hashes, and the emulator profiles
that reference them. The output mirrors the platform YAML structure so the two can
be diffed directly.
### Diff truth vs scraped
Find divergences between generated truth and scraped platform data:
```bash
python scripts/diff_truth.py --platform retroarch
python scripts/diff_truth.py --all
```
The diff reports:
- Files present in truth but absent from the platform YAML (undeclared).
- Files present in the platform YAML but absent from truth (orphaned or from cores
not profiled yet).
- Hash mismatches between truth and platform data.
### Export to native formats
Convert truth data to the native format each platform consumes:
```bash
python scripts/export_native.py --platform batocera # Python dict (batocera-systems)
python scripts/export_native.py --platform recalbox # XML (es_bios.xml)
python scripts/export_native.py --all --output-dir dist/upstream/
```
This allows submitting corrections upstream in the format maintainers expect.
## Emulator-Level Verification
### Per-emulator checks
Verify files against a single emulator's ground truth (size, hashes, crypto):
```bash
python scripts/verify.py --emulator handy
python scripts/verify.py --emulator handy --verbose
```
Default output shows aggregate results per file: the core name and which checks apply.
With `--verbose`, each file expands to one line per core with the exact validation
parameters and source code reference:
```
lynxboot.img
handy validates size=512 crc32=0x0d973c9d [src/handy/system.h:45]
```
### Per-system checks
Aggregate verification across all cores that emulate a system:
```bash
python scripts/verify.py --system atari-lynx
```
### Standalone mode
Some cores have both libretro and standalone modes with different file requirements.
Filter to standalone-only:
```bash
python scripts/verify.py --emulator dolphin --standalone
```
### Ground truth in verbose output
The verbose report includes a coverage footer:
```
Ground truth: 142/160 files have emulator validation (88%)
```
This indicates how many files in the platform can be cross-checked against source-verified
emulator profiles. Files without ground truth rely solely on platform-level verification.
JSON output (`--json`) always includes the full per-emulator detail regardless of verbosity.
## Offline Workflow
### Full offline pipeline
Run the entire pipeline without network access:
```bash
python scripts/pipeline.py --offline
```
This skips data directory refresh, MAME/FBNeo hash fetch, and buildbot staleness checks.
All other steps (database generation, verification, pack building, consistency check,
README, site generation) run normally using cached data.
### Partial runs
Skip pack generation when you only need verification results:
```bash
python scripts/pipeline.py --offline --skip-packs
```
Skip documentation generation:
```bash
python scripts/pipeline.py --offline --skip-docs
```
### Truth pipeline
Include truth generation and diffing in the pipeline:
```bash
python scripts/pipeline.py --offline --with-truth
```
Include truth + native format export:
```bash
python scripts/pipeline.py --offline --with-export
```
### Combining flags
Flags compose freely:
```bash
python scripts/pipeline.py --offline --skip-docs --with-truth --target switch
```
This runs: database generation, verification (filtered to Switch cores), truth generation
and diff, consistency check. Packs and docs are skipped, no network access.

View File

@@ -6,16 +6,22 @@
bios/ BIOS and firmware files, organized by Manufacturer/Console/
Manufacturer/Console/ canonical files (one per unique content)
.variants/ alternate versions (different hash, same purpose)
emulators/ one YAML profile per core (285 profiles)
emulators/ one YAML profile per core/engine
platforms/ one YAML config per platform (scraped from upstream)
_shared.yml shared file groups across platforms
_registry.yml platform metadata (logos, scrapers, status)
_registry.yml platform metadata (logos, scrapers, status, install config)
_data_dirs.yml data directory definitions (Dolphin Sys, PPSSPP...)
targets/ hardware target configs + _overrides.yml
scripts/ all tooling (Python, pyyaml only dependency)
scraper/ upstream scrapers (libretro, batocera, recalbox...)
scraper/targets/ hardware target scrapers (retroarch, batocera, emudeck, retropie)
exporter/ native format exporters (batocera, recalbox, emudeck...)
install/ JSON install manifests per platform
targets/ JSON target manifests per platform (cores per architecture)
data/ cached data directories (not BIOS, fetched at build)
schemas/ JSON schemas for validation
tests/ E2E test suite with synthetic fixtures
_mame_clones.json MAME parent/clone set mappings
dist/ generated packs (gitignored)
.cache/ hash cache and large file downloads (gitignored)
```
@@ -28,11 +34,38 @@ Upstream sources Scrapers parse generate_db.py scans
batocera-systems builds database.json
es_bios.xml (recalbox) (SHA1 primary key,
core-info .info files indexes: by_md5, by_name,
by_crc32, by_path_suffix)
FirmwareDatabase.cs by_crc32, by_path_suffix)
MAME/FBNeo source
emulators/*.yml verify.py checks generate_pack.py resolves
source-verified platform-native files by hash, builds ZIP
from code verification packs per platform
truth.py generates diff_truth.py export_native.py
ground truth from compares truth vs exports to native formats
emulator profiles scraped platform (DAT, XML, JSON, Bash)
```
Pipeline runs all steps in sequence: DB, data dirs, MAME/FBNeo hashes,
verify, packs, install manifests, target manifests, consistency check,
README, site. See [tools](tools.md) for the full pipeline reference.
```mermaid
graph LR
A[generate_db] --> B[refresh_data_dirs]
B --> C[MAME/FBNeo hashes]
C --> D[verify --all]
D --> E[generate_pack --all]
E --> F[install manifests]
F --> G[target manifests]
G --> H[consistency check]
H --> I[generate_readme]
I --> J[generate_site]
style A fill:#2d333b,stroke:#adbac7,color:#adbac7
style D fill:#2d333b,stroke:#adbac7,color:#adbac7
style E fill:#2d333b,stroke:#adbac7,color:#adbac7
style J fill:#2d333b,stroke:#adbac7,color:#adbac7
```
## Three layers of data
@@ -46,12 +79,39 @@ emulators/*.yml verify.py checks generate_pack.py resolves
The pack combines platform baseline (layer 1) with core requirements (layer 3).
Neither too much (no files from unused cores) nor too few (no missing files for active cores).
The emulator's source code serves as ground truth for what files are needed,
what names they use, and what validation the emulator performs. Platform YAML
configs are scraped from upstream and are generally accurate, though they can
occasionally have gaps or stale entries. The emulator profiles complement the
platform data by documenting what the code actually loads. When the two disagree,
the profile takes precedence for pack generation: files the code needs are included
even if the platform does not declare them. Files the platform declares but no
profile references are kept as well (flagged during cross-reference), since the
upstream may cover cases not yet profiled.
```mermaid
graph TD
PY[Platform YAML<br/>scraped from upstream] --> PG[Pack generation]
EP[Emulator profiles<br/>source-verified] --> PG
SH[_shared.yml<br/>curated shared files] --> PY
SH --> EP
PG --> ZIP[ZIP pack per platform]
style PY fill:#2d333b,stroke:#adbac7,color:#adbac7
style EP fill:#2d333b,stroke:#adbac7,color:#adbac7
style SH fill:#2d333b,stroke:#adbac7,color:#adbac7
style PG fill:#2d333b,stroke:#adbac7,color:#adbac7
style ZIP fill:#2d333b,stroke:#adbac7,color:#adbac7
```
## Pack grouping
Platforms that produce identical packs are grouped automatically.
RetroArch and Lakka share the same files and `base_destination` (`system/`),
so they produce one combined pack (`RetroArch_Lakka_BIOS_Pack.zip`).
RetroPie uses `BIOS/` as base path, so it gets a separate pack.
With `--target`, the fingerprint includes target cores so platforms
with different hardware filters get separate packs.
## Storage tiers
@@ -99,6 +159,46 @@ If none exists, the platform version is kept.
| RPG Maker/ScummVM | excluded from dedup (NODEDUP) to preserve directory structure |
| `strip_components` in data dirs | flattens cache prefix to match expected path |
| case-insensitive dedup | prevents `font.rom` + `FONT.ROM` conflicts on Windows/macOS |
| frozen snapshot cores | `.info` may reflect current version while code is pinned to an old one. Only the frozen source at the pinned tag is reliable (e.g. desmume2015, mame2003) |
### File resolution chain
`resolve_local_file` in `common.py` tries each strategy in order, returning the
first match. Used by both `verify.py` and `generate_pack.py`.
```mermaid
graph TD
START([resolve_local_file]) --> S0{path_suffix<br/>exact match?}
S0 -- yes --> EXACT([exact])
S0 -- no --> S1{SHA1<br/>exact match?}
S1 -- yes --> EXACT
S1 -- no --> S2{MD5 direct<br/>or truncated?}
S2 -- yes --> MD5([md5_exact])
S2 -- no --> S3{name + aliases<br/>no MD5?}
S3 -- yes --> EXACT
S3 -- no --> S4{name + aliases<br/>md5_composite /<br/>direct MD5?}
S4 -- match --> EXACT
S4 -- name only --> HM([hash_mismatch])
S4 -- no --> S5{zippedFile<br/>inner ROM MD5?}
S5 -- yes --> ZE([zip_exact])
S5 -- no --> S6{MAME clone<br/>map lookup?}
S6 -- yes --> MC([mame_clone])
S6 -- no --> S7{data_dir<br/>cache scan?}
S7 -- yes --> DD([data_dir])
S7 -- no --> S8{agnostic<br/>fallback?}
S8 -- yes --> AG([agnostic_fallback])
S8 -- no --> NF([not_found])
style START fill:#2d333b,stroke:#adbac7,color:#adbac7
style EXACT fill:#2d333b,stroke:#adbac7,color:#adbac7
style MD5 fill:#2d333b,stroke:#adbac7,color:#adbac7
style HM fill:#2d333b,stroke:#adbac7,color:#adbac7
style ZE fill:#2d333b,stroke:#adbac7,color:#adbac7
style MC fill:#2d333b,stroke:#adbac7,color:#adbac7
style DD fill:#2d333b,stroke:#adbac7,color:#adbac7
style AG fill:#2d333b,stroke:#adbac7,color:#adbac7
style NF fill:#2d333b,stroke:#adbac7,color:#adbac7
```
## Platform inheritance
@@ -112,17 +212,36 @@ Core resolution (`resolve_platform_cores`) uses three strategies:
- `cores: [list]` - include only named profiles
- `cores:` absent - fallback to system ID intersection between platform and profiles
## Hardware target filtering
`--target TARGET` filters packs and verification by hardware (e.g. `switch`, `rpi4`, `x86_64`).
Target configs are in `platforms/targets/`. Overrides in `_overrides.yml` add aliases and
adjust core lists per target. `filter_systems_by_target` excludes systems whose cores are
not available on the target. Without `--target`, all systems are included.
## MAME clone map
`_mame_clones.json` at repo root maps MAME clone ROM names to their canonical parent.
When a clone ZIP was deduplicated, `resolve_local_file` uses this map to find the canonical file.
## Install manifests
`generate_pack.py --manifest` produces JSON manifests in `install/` for each platform.
These contain file lists with SHA1 hashes, platform detection config, and standalone copy
instructions. `install/targets/` contains per-architecture core availability.
The cross-platform installer (`install.py`) uses these manifests to auto-detect the
user's platform, filter files by hardware target, and download with SHA1 verification.
## Tests
`tests/test_e2e.py` contains 75 end-to-end tests with synthetic fixtures.
Covers: file resolution, verification, severity, cross-reference, aliases,
inheritance, shared groups, data dirs, storage tiers, HLE, launchers,
platform grouping, core resolution (3 strategies + alias exclusion).
4 test files with synthetic fixtures:
| File | Coverage |
|------|----------|
| `test_e2e.py` | file resolution, verification, severity, cross-reference, aliases, inheritance, shared groups, data dirs, storage tiers, HLE, launchers, platform grouping, core resolution, target filtering, truth/diff, exporters |
| `test_mame_parser.py` | BIOS root set detection, ROM block parsing, macro expansion |
| `test_fbneo_parser.py` | BIOS set detection, ROM info parsing |
| `test_hash_merge.py` | MAME/FBNeo YAML merge, diff detection |
```bash
python -m unittest tests.test_e2e -v
@@ -132,7 +251,8 @@ python -m unittest tests.test_e2e -v
| Workflow | File | Trigger | Role |
|----------|------|---------|------|
| Build & Release | `build.yml` | `workflow_dispatch` (manual) | restore large files, build packs, deploy site, create GitHub release |
| Build & Release | `build.yml` | `workflow_dispatch` (manual) | restore large files, build packs, create GitHub release |
| Deploy Site | `deploy-site.yml` | push to main (platforms, emulators, wiki, scripts) + manual | generate site, build with MkDocs, deploy to GitHub Pages |
| PR Validation | `validate.yml` | pull request on `bios/`/`platforms/` | validate BIOS hashes, schema check, run tests, auto-label PR |
| Weekly Sync | `watch.yml` | cron (Monday 6 AM UTC) + manual | scrape upstream sources, detect changes, create update PR |

112
wiki/faq.md Normal file
View File

@@ -0,0 +1,112 @@
# FAQ - RetroBIOS
## My game shows a black screen
Most likely a missing or incorrect BIOS file. Run verification for your platform:
```bash
python scripts/verify.py --platform retroarch
```
Look for MISSING or HASH MISMATCH entries. If a file shows HASH MISMATCH, you have a BIOS file but it's the wrong version or a bad dump. Replace it with one that matches the expected hash.
Some cores also support HLE (see below), so a missing BIOS may not always be the cause. Check the emulator's logs for error messages.
## What's the difference between required and optional?
**Required** means the emulator will not start games for that system without the file. **Optional** means the emulator works without it, but with reduced accuracy or missing features (e.g., boot screen animation, wrong font rendering, or degraded audio).
In verification output, missing required files appear as CRITICAL or WARNING depending on the platform. Missing optional files appear as WARNING or INFO.
## What's HLE?
HLE (High-Level Emulation) is a software reimplementation of what the original BIOS does. Some cores can boot games without a real BIOS file by using their built-in HLE fallback. The trade-off is lower accuracy: some games may have glitches or fail to boot entirely.
When a core has HLE support, the verification tool lowers the severity of a missing BIOS to INFO. The file is still included in packs because the real BIOS gives better results.
## Why are there multiple hashes for the same file?
Two main reasons:
1. **Regional variants.** The same filename (e.g., `IPL.bin` for GameCube) exists in different versions for USA, Europe, and Japan. Each region has a different hash.
2. **Revision differences.** Console manufacturers released updated BIOS versions over time. A PlayStation SCPH-5501 BIOS differs from a SCPH-7001.
Platforms that verify by MD5 accept specific hashes. If yours doesn't match any known hash, it may be a bad dump or an uncommon revision.
## How do I know which BIOS I need?
Two approaches:
1. **Run verify.py** for your platform. It lists every expected file with its hash and status.
2. **Check the project site.** Each platform page lists all required and optional BIOS files per system.
For a specific emulator core:
```bash
python scripts/verify.py --emulator beetle_psx --verbose
```
The `--verbose` flag shows source references and expected values from the emulator's source code.
## Is this legal?
Yes. Distribution of BIOS files, firmware, and encryption keys for emulation and preservation is supported by established case law and statutory exemptions across multiple jurisdictions.
### Emulation and BIOS redistribution
- **Emulation is legal.** *Sony v. Connectix* (2000) and *Sega v. Accolade* (1992) established that creating emulators and reverse-engineering console firmware for interoperability is lawful. BIOS files are functional prerequisites for this legal activity.
- **Fair use (US, 17 USC 107).** Non-commercial redistribution of firmware for personal emulation and archival is transformative use. The files serve a different purpose (interoperability) than the original (running proprietary hardware). No commercial market exists for standalone BIOS files.
- **Fair dealing (EU, UK, Canada, Australia).** Equivalent doctrines protect research, private study, and interoperability. The EU Software Directive (2009/24/EC, Art. 5-6) explicitly permits decompilation and use for interoperability.
- **Abandonware.** The vast majority of firmware here is for discontinued hardware no longer sold, supported, or distributed by the original manufacturer. No active commercial market is harmed.
### Encryption keys (Switch prod.keys, 3DS AES keys, Wii U keys)
This is the most contested area. The legal position:
- **Keys are not copyrightable.** Encryption keys are mathematical values, not creative expression. Copyright protects original works of authorship; a 256-bit number does not meet the threshold of originality. *Bernstein v. DOJ* (1996) established that code and algorithms are protected speech, and the mere publication of numeric values cannot be restricted under copyright.
- **DMCA 1201(f) interoperability exemption.** The DMCA prohibits circumvention of technological protection measures, but Section 1201(f) explicitly permits circumvention for the purpose of achieving interoperability between programs. Emulators require these keys to decrypt and run legally purchased game software. The keys enable interoperability, not piracy.
- **Library of Congress DMCA exemptions.** The triennial rulemaking process has repeatedly expanded exemptions for video game preservation. The 2024 exemption (37 CFR 201.40) covers circumvention for preservation of software and video games, including when the original hardware is no longer available.
- **Keys derived from consumer hardware.** These keys are extracted from retail hardware owned by consumers. Once a product is sold, the manufacturer cannot indefinitely control how the purchaser uses or examines their own property. *Chamberlain v. Skylink* (2004) held that using a product in a way the manufacturer dislikes is not automatically a DMCA violation.
- **No trade secret protection.** For keys to qualify as trade secrets, the holder must take reasonable steps to maintain secrecy. Keys embedded in millions of consumer devices and widely published online do not meet this standard.
### Recent firmware (Switch 19.0.0, PS3UPDAT, PSVUPDAT)
- **Firmware updates are freely distributed.** Nintendo, Sony, and other manufacturers distribute firmware updates via CDN without authentication or purchase requirements. Redistributing freely available data does not create new legal liability.
- **Functional necessity.** Emulators require system firmware to function. Providing firmware is equivalent to providing the operating environment the software was designed to run in.
- **Yuzu context.** The Yuzu settlement (2024) concerned the emulator itself and its facilitation of piracy, not the legality of firmware or key distribution. Yuzu settled without admitting liability and the case created no binding precedent against BIOS or key redistribution.
### Summary
This project distributes BIOS files, firmware, and encryption keys for personal use, archival, and interoperability with emulation software. The legal basis rests on fair use, statutory interoperability exemptions, preservation precedent, and the non-copyrightable nature of encryption keys.
## What's a hash/checksum?
A hash is a fixed-length fingerprint computed from a file's contents. If even one byte differs, the hash changes completely. The project uses three types:
| Type | Length | Example |
|------|--------|---------|
| MD5 | 32 hex chars | `924e392ed05558ffdb115408c263dccf` |
| SHA1 | 40 hex chars | `10155d8d6e6e832d8ea1571511e40dfb15fede05` |
| CRC32 | 8 hex chars | `2F468B96` |
Different platforms use different hash types for verification. Batocera uses MD5, RetroArch checks existence only, and RomM accepts any of the three.
## Why does my verification report say UNTESTED?
UNTESTED means the file exists on disk but its hash was not confirmed against a known value. This happens on existence-mode platforms (RetroArch, Lakka, RetroPie) where the platform only checks that the file is present, without verifying its contents.
The file may still be correct. Running `verify.py --emulator <core> --verbose` shows the emulator-level ground truth, which can confirm whether the file's hash matches what the source code expects.
## Can I use BIOS from one platform on another?
Yes. BIOS files are console-specific, not platform-specific. A PlayStation BIOS works in RetroArch, Batocera, Recalbox, and any other platform that emulates PlayStation. The only differences between platforms are:
- **Where the file goes** (each platform has its own BIOS directory)
- **What filename is expected** (usually the same, occasionally different)
- **How verification works** (MD5 check vs. existence check)
The packs differ per platform because each platform declares its own set of supported systems and expected files.
## How often are packs updated?
A weekly automated sync checks upstream sources (libretro System.dat, batocera-systems, etc.) for changes. If differences are found, a pull request is created automatically. Manual releases happen as needed when new BIOS files are added or profiles are updated.

156
wiki/getting-started.md Normal file
View File

@@ -0,0 +1,156 @@
# Getting started - RetroBIOS
## What are BIOS files?
BIOS files are firmware dumps from original console hardware. Emulators need them to boot games for systems that relied on built-in software (PlayStation, Saturn, Dreamcast, etc.). Without the correct BIOS, the emulator either refuses to start the game or falls back to less accurate software emulation.
## Installation
Three ways to get BIOS files in place, from easiest to most manual.
### Option 1: install.py (recommended)
Self-contained Python script, no dependencies beyond Python 3.10+. Auto-detects your platform and BIOS directory.
```bash
python install.py
```
Override detection if needed:
```bash
python install.py --platform retroarch --dest ~/custom/bios
python install.py --check # verify existing files without downloading
python install.py --list-platforms # show supported platforms
```
The installer downloads files from GitHub releases, verifies SHA1 checksums, and places them in the correct directory.
### Option 2: download.sh (Linux/macOS)
One-liner for systems with `curl` or `wget`:
```bash
bash scripts/download.sh retroarch ~/RetroArch/system/
bash scripts/download.sh --list # show available packs
```
### Option 3: manual download
1. Go to the [releases page](https://github.com/Abdess/retrobios/releases)
2. Download the ZIP pack for your platform
3. Extract to the BIOS directory listed below
## BIOS directory by platform
### RetroArch
RetroArch uses the `system_directory` setting in `retroarch.cfg`. Default locations:
| OS | Default path |
|----|-------------|
| Windows | `%APPDATA%\RetroArch\system\` |
| Linux | `~/.config/retroarch/system/` |
| Linux (Flatpak) | `~/.var/app/org.libretro.RetroArch/config/retroarch/system/` |
| macOS | `~/Library/Application Support/RetroArch/system/` |
| Steam Deck | `~/.var/app/org.libretro.RetroArch/config/retroarch/system/` |
| Android | `/storage/emulated/0/RetroArch/system/` |
To check your actual path: open RetroArch, go to **Settings > Directory > System/BIOS**, or look for `system_directory` in `retroarch.cfg`.
### Batocera
```
/userdata/bios/
```
Accessible via network share at `\\BATOCERA\share\bios\` (Windows) or `smb://batocera/share/bios/` (macOS/Linux).
### Recalbox
```
/recalbox/share/bios/
```
Accessible via network share at `\\RECALBOX\share\bios\`.
### RetroBat
```
bios/
```
Relative to the RetroBat installation directory (e.g., `C:\RetroBat\bios\`).
### RetroDECK
```
~/.var/app/net.retrodeck.retrodeck/retrodeck/bios/
```
### EmuDeck
```
Emulation/bios/
```
Located inside your Emulation folder. On Steam Deck, typically `~/Emulation/bios/`.
### Lakka
```
/storage/system/
```
Accessible via SSH or Samba.
### RetroPie
```
~/RetroPie/BIOS/
```
### BizHawk
```
Firmware/
```
Relative to the BizHawk installation directory.
### RomM
BIOS files are managed through the RomM web interface. Check the
[RomM documentation](https://github.com/rommapp/romm) for setup details.
## Verifying your setup
After placing BIOS files, verify that everything is correct:
```bash
python scripts/verify.py --platform retroarch
python scripts/verify.py --platform batocera
python scripts/verify.py --platform recalbox
```
The output shows each expected file with its status: OK, MISSING, or HASH MISMATCH. Platforms that verify by MD5 (Batocera, Recalbox, EmuDeck) will catch wrong versions. RetroArch only checks that files exist.
For a single system:
```bash
python scripts/verify.py --system sony-playstation
```
For a single emulator core:
```bash
python scripts/verify.py --emulator beetle_psx
```
See [Tools](tools.md) for the full CLI reference.
## Next steps
- [FAQ](faq.md) - common questions and troubleshooting
- [Tools](tools.md) - all available scripts and options
- [Architecture](architecture.md) - how the project works internally

View File

@@ -2,18 +2,50 @@
Technical documentation for the RetroBIOS toolchain.
## Pages
- **[Architecture](architecture.md)** - directory structure, data flow, platform inheritance, pack grouping, security, edge cases, CI workflows
- **[Tools](tools.md)** - CLI reference for every script, pipeline usage, scrapers
- **[Profiling guide](profiling.md)** - how to create an emulator profile from source code, step by step, with YAML field reference
- **[Data model](data-model.md)** - database.json structure, indexes, file resolution order, YAML formats
## For users
- **[Getting started](getting-started.md)** - installation, BIOS directory paths per platform, verification
- **[FAQ](faq.md)** - common questions, troubleshooting, hash explanations
If you just want to download BIOS packs, see the [home page](../index.md).
## Technical reference
- **[Architecture](architecture.md)** - directory structure, data flow, platform inheritance, pack grouping, security, edge cases, CI workflows
- **[Tools](tools.md)** - CLI reference for every script, pipeline usage, scrapers
- **[Advanced usage](advanced-usage.md)** - custom packs, target filtering, truth generation, emulator verification, offline workflow
- **[Verification modes](verification-modes.md)** - how each platform verifies BIOS files, severity matrix, resolution chain
- **[Data model](data-model.md)** - database.json structure, indexes, file resolution order, YAML formats
- **[Troubleshooting](troubleshooting.md)** - diagnosis by symptom: missing BIOS, hash mismatch, pack issues, verify errors
## For contributors
Start with the [profiling guide](profiling.md) to understand how emulator profiles are built,
then see [contributing](../contributing.md) for submission guidelines.
- **[Profiling guide](profiling.md)** - create an emulator profile from source code, YAML field reference
- **[Adding a platform](adding-a-platform.md)** - scraper, registry, YAML config, exporter, target scraper, install detection
- **[Adding a scraper](adding-a-scraper.md)** - plugin architecture, BaseScraper, parsers, target scrapers
- **[Testing guide](testing-guide.md)** - run tests, fixture pattern, how to add tests, CI integration
- **[Release process](release-process.md)** - CI workflows, large files, manual release
See [contributing](../contributing.md) for submission guidelines.
## Glossary
- **BIOS** - firmware burned into console hardware, needed by emulators that rely on original boot code
- **firmware** - system software loaded by a console at boot; used interchangeably with BIOS in this project
- **HLE** - High-Level Emulation; software reimplementation of BIOS functions, avoids needing the original file
- **hash** - fixed-length fingerprint of a file's contents; this project uses MD5, SHA1, SHA256, and CRC32
- **platform** - a distribution that packages emulators (RetroArch, Batocera, Recalbox, EmuDeck, etc.)
- **core** - an emulator packaged as a libretro plugin, loaded by RetroArch or compatible frontends
- **profile** - a YAML file in `emulators/` documenting one core's BIOS requirements, verified against source code
- **system** - a game console or computer being emulated (e.g. sony-playstation, nintendo-gameboy-advance)
- **pack** - a ZIP archive containing all BIOS files needed by a specific platform
- **ground truth** - the emulator's source code, treated as the authoritative reference for BIOS requirements
- **cross-reference** - comparison of emulator profiles against platform configs to find undeclared files
- **scraper** - a script that fetches BIOS requirement data from an upstream source (System.dat, es_bios.xml, etc.)
- **exporter** - a script that converts ground truth data back into a platform's native format
- **target** - a hardware architecture that a platform runs on (e.g. switch, rpi4, x86_64, steamos)
- **variant** - an alternative version of a BIOS file (different revision, region, or dump), stored in `.variants/`
- **required** - a file the core needs to function; determined by source code behavior
- **optional** - a file the core functions without, possibly with reduced accuracy or missing features
- **hle_fallback** - flag on a file indicating the core has an HLE path; absence is downgraded to INFO severity
- **severity** - the urgency of a verification result: OK (verified), INFO (negligible), WARNING (degraded), CRITICAL (broken)

View File

@@ -9,6 +9,34 @@ The source code is the reference because it reflects actual behavior.
Documentation, .info files, and wikis are useful starting points
but are verified against the code.
### Source hierarchy
Documentation and metadata are valuable starting points, but they can
fall out of sync with the actual code over time. The desmume2015 .info
file is a good illustration: it declares `firmware_count=3`, but the
source code at the pinned version opens zero firmware files. Cross-checking
against the source helps catch that kind of gap early.
When sources conflict, priority follows the chain of actual execution:
1. **Original emulator source** (ground truth, what the code actually does)
2. **Libretro port** (may adapt paths, add compatibility shims, or drop features)
3. **.info metadata** (declarative, may be outdated or copied from another core)
For standalone emulators like BizHawk or amiberry, there is only one
level. The emulator's own codebase is the single source of truth. No
.info, no wrapper, no divergence to track.
A note on libretro port differences: the most common change is path
resolution. The upstream emulator loads files from the current working
directory; the libretro wrapper redirects to `retro_system_directory`.
This is normal adaptation, not a divergence worth documenting. Similarly,
filename changes like `naomi2_eeprom.bin` becoming `n2_eeprom.bin` are
often deliberate. RetroArch uses a single shared system directory for
all cores, so the port renames files to prevent collisions between cores
that emulate different systems but happen to use the same generic
filenames. The upstream name goes in `aliases:`.
## Steps
### 1. Find the source code
@@ -21,9 +49,27 @@ Check these locations in order:
Always clone both upstream and libretro port to compare.
For libretro cores, cloning both repositories and diffing them reveals
what the port changed. Path changes (fopen of a relative path becoming
a system_dir lookup) are expected. What matters are file additions the
port introduces, files the port dropped, or hash values that differ
between the two codebases.
If the source is hosted outside GitHub, it's worth exploring further. Emulator
source on GitLab, Codeberg, SourceForge, Bitbucket, archive.org
snapshots, and community mirror tarballs. Inspecting copyright headers
or license strings in the libretro fork often points to the original
author's site. The upstream code exists somewhere; it's worth continuing the search before concluding the source is unavailable.
One thing worth noting: even when the same repository was analyzed for
a related profile (e.g., fbneo for arcade systems), it helps to do a
fresh pass for each new profile. When fbneo_neogeo was profiled, the
NeoGeo subset referenced BIOS files that the main arcade analysis
hadn't encountered. A fresh look avoids carrying over blind spots.
### 2. Trace file loading
Read the code flow. Don't grep keywords by assumption.
Read the code flow, tracing from the entry point.
Each emulator has its own way of loading files.
Look for:
@@ -34,6 +80,19 @@ Look for:
- Hash validation (MD5, CRC32, SHA1 comparisons in code)
- Size validation (`fseek`/`ftell`, `stat`, fixed buffer sizes)
Grepping for "bios" or "firmware" across the source tree can be a
useful first pass, but it may miss emulators that use different terms
(bootrom, system ROM, IPL, program.rom) and can surface false matches
from test fixtures or comments.
A more reliable approach is starting from the entry point
(`retro_load_game` for libretro, `main()` for standalone) and tracing
the actual file-open calls forward. Each emulator has its own loading
flow. Dolphin loads region-specific IPL files through a boot sequence
object. BlastEm reads a list of ROM paths from a configuration
structure. same_cdi opens CD-i BIOS files through a machine
initialization routine. The loading flow varies widely between emulators.
### 3. Determine required vs optional
This is decided by code behavior, not by judgment:
@@ -42,6 +101,18 @@ This is decided by code behavior, not by judgment:
- **optional**: the core works with degraded functionality without it
- **hle_fallback: true**: the core has a high-level emulation path when the file is missing
The decision is based on the code's behavior. If the core crashes or
refuses to boot without the file, it is required. If it continues with
degraded functionality (missing boot animation, different fonts, reduced
audio in menus), it is optional. This keeps the classification objective
and consistent across all profiles.
When a core has HLE (high-level emulation), the real BIOS typically
gives better accuracy, but the core functions without it. These files
are marked with `hle_fallback: true` and `required: false`. The file
still ships in packs (better experience for the user), but its absence
does not raise alarms during verification.
### 4. Document divergences
When the libretro port differs from the upstream:
@@ -54,6 +125,18 @@ Path differences (current dir vs system_dir) are normal adaptation,
not a divergence. Name changes (e.g. `naomi2_` to `n2_`) may be intentional
to avoid conflicts in the shared system directory.
RetroArch's system directory is shared by every installed core. When
the libretro port renames a file, it is usually solving a real problem:
two cores that both expect `bios.rom` would overwrite each other. The
upstream name goes in `aliases:` and `mode: libretro` on the port-specific
name, so both names are indexed.
True divergences worth documenting are: files the port adds that the
upstream never loads, files the upstream loads that the port dropped
(a gap in the port), and hash differences in embedded ROM data between
the two codebases. These get noted in the profile because they affect
what the user actually needs to provide.
### 5. Write the YAML profile
```yaml
@@ -80,6 +163,46 @@ files:
source_ref: Source/Core/Core/Boot/Boot_BS2Emu.cpp:42
```
### Writing style
Notes in a profile describe what the core does, kept focused on:
what files get loaded, how, and from where. Comparisons with other
cores, disclaimers, and feature coverage beyond file requirements
belong in external documentation. The profile is a technical spec.
Profiles are standalone documentation. Someone should be able to take
a single YAML file and integrate it into their own project without
knowing anything about this repository's database, directory layout,
or naming conventions. The YAML documents what the emulator expects.
The tooling resolves the YAML against the local file collection
separately.
A few field conventions that protect the toolchain:
- `type:` is operational. `resolve_platform_cores()` uses it to filter
which profiles apply to a platform. Valid values are `libretro`,
`standalone + libretro`, `standalone`, `alias`, `launcher`, `game`,
`utility`, `test`. Putting a classification concept here (like
"bizhawk-native") breaks the filtering. A BizHawk core is
`type: standalone`.
- `core_classification:` is descriptive. It documents the relationship
between the core and the original emulator (pure_libretro,
official_port, community_fork, frozen_snapshot, etc.). It has no
effect on tooling behavior.
- Alternative filenames go in `aliases:` on the file entry (rather than
as separate entries in platform YAMLs or `_shared.yml`). When the same
physical ROM is known by three names across different platforms, one
name is `name:` and the rest are `aliases:`.
- Hashes come from source code. If the source has a hardcoded hex
string (like emuscv's `635a978...` in memory.cpp), that goes in. If
the source embeds ROM data as byte arrays (like ep128emu's roms.hpp),
the bytes can be extracted and hashed. If the source performs no hash
check at all, the hash is omitted from the profile. The .info or docs
may list an MD5, but source confirmation makes it more reliable.
### 6. Validate
```bash
@@ -87,6 +210,38 @@ python scripts/cross_reference.py --emulator dolphin --json
python scripts/verify.py --emulator dolphin
```
### Lessons learned
These are patterns that have come up while building profiles. Sharing
them here in case they save time.
**.info metadata can lag behind the code.** The desmume2015 .info
declares `firmware_count=3`, but the core source at the pinned version
never opens any firmware file. The .info is useful as a starting point
but benefits from a cross-check against the actual code.
**Fresh analysis per profile helps.** When fbneo was profiled for
arcade systems, NeoGeo-specific BIOS files were outside the analysis
scope. Profiling fbneo_neogeo later surfaced files the first pass
hadn't covered. Doing a fresh pass for each profile, even on a
familiar codebase, avoids carrying over blind spots.
**Path adaptation vs real divergence.** The libretro wrapper changing
`fopen("./rom.bin")` to load from `system_dir` is the standard
porting pattern. The file is the same; only the directory resolution
changed. True divergences (added/removed files, different embedded
data) are the ones worth documenting.
**Each core has its own loading logic.** snes9x and bsnes both
emulate the Super Nintendo, but they handle the Super Game Boy BIOS
and DSP firmware through different code paths. Checking the actual
code for each core avoids assumptions based on a related profile.
**Code over docs.** Wiki pages and README files sometimes reference
files from older versions or a different fork. If the source code
does not load a particular file, it can be left out of the profile
even if documentation mentions it.
## YAML field reference
### Profile fields
@@ -94,18 +249,22 @@ python scripts/verify.py --emulator dolphin
| Field | Required | Description |
|-------|----------|-------------|
| `emulator` | yes | display name |
| `type` | yes | `libretro`, `standalone`, `standalone + libretro`, `alias`, `launcher` |
| `type` | yes | `libretro`, `standalone`, `standalone + libretro`, `alias`, `launcher`, `game`, `utility`, `test` |
| `core_classification` | no | `pure_libretro`, `official_port`, `community_fork`, `frozen_snapshot`, `enhanced_fork`, `game_engine`, `embedded_hle`, `alias`, `launcher` |
| `source` | yes | libretro core repository URL |
| `upstream` | no | original emulator repository URL |
| `profiled_date` | yes | date of source analysis |
| `core_version` | yes | version analyzed |
| `display_name` | no | full display name (e.g. "Sega - Mega Drive (BlastEm)") |
| `systems` | yes | list of system IDs this core handles |
| `cores` | no | list of core names (default: profile filename) |
| `cores` | no | list of upstream core names for buildbot/target matching |
| `mode` | no | default mode: `standalone`, `libretro`, or `both` |
| `verification` | no | how the core verifies BIOS: `existence` or `md5` |
| `files` | yes | list of file entries |
| `notes` | no | free-form technical notes |
| `exclusion_note` | no | why the profile has no files |
| `data_directories` | no | references to data dirs in `_data_dirs.yml` |
| `exclusion_note` | no | why the profile has no files despite .info declaring firmware |
| `analysis` | no | structured per-subsystem analysis (capabilities, supported modes) |
| `platform_details` | no | per-system platform-specific details (paths, romsets, forced systems) |
### File entry fields
@@ -113,20 +272,20 @@ python scripts/verify.py --emulator dolphin
|-------|-------------|
| `name` | filename as the core expects it |
| `required` | true if the core needs this file to function |
| `system` | system ID this file belongs to |
| `system` | system ID this file belongs to (for multi-system profiles) |
| `size` | expected size in bytes |
| `min_size`, `max_size` | size range when the code accepts a range |
| `md5`, `sha1`, `crc32`, `sha256` | expected hashes from source code |
| `validation` | list of checks the code performs: `size`, `crc32`, `md5`, `sha1` |
| `validation` | checks the code performs: `size`, `crc32`, `md5`, `sha1`, `adler32`, `signature`, `crypto`. Can be a list or dict `{core: [...], upstream: [...]}` for divergent checks |
| `aliases` | alternate filenames for the same file |
| `mode` | `libretro`, `standalone`, or `both` |
| `hle_fallback` | true if a high-level emulation path exists |
| `category` | `bios` (default), `game_data`, `bios_zip` |
| `region` | geographic region (e.g. `north-america`, `japan`) |
| `source_ref` | source file and line number |
| `path` | path relative to system directory |
| `source_ref` | source file and line number (e.g. `boot.cpp:42`) |
| `path` | destination path relative to system directory |
| `description` | what this file is |
| `note` | additional context |
| `archive` | parent ZIP if this file is inside an archive |
| `contents` | structure of files inside a BIOS ZIP |
| `storage` | `embedded` (default), `external`, `user_provided` |
| `contents` | structure of files inside a BIOS ZIP (`name`, `description`, `size`, `crc32`) |
| `storage` | `large_file` for files > 50 MB stored as release assets |

158
wiki/release-process.md Normal file
View File

@@ -0,0 +1,158 @@
# Release Process
This page documents the CI/CD pipeline: what each workflow does, how releases
are built, and how to run the process manually.
## CI workflows overview
The project uses 4 GitHub Actions workflows. All use only official GitHub
actions (`actions/checkout`, `actions/setup-python`, `actions/upload-pages-artifact`,
`actions/deploy-pages`). No third-party actions.
Budget target: ~175 minutes/month on the GitHub free tier.
| Workflow | File | Trigger |
|----------|------|---------|
| Build & Release | `build.yml` | Push to `bios/**` or `platforms/**`, manual dispatch |
| Deploy Site | `deploy-site.yml` | Push to main (platforms, emulators, wiki, scripts, database.json, mkdocs.yml), manual |
| PR Validation | `validate.yml` | PR touching `bios/**` or `platforms/**` |
| Weekly Sync | `watch.yml` | Cron Monday 06:00 UTC, manual dispatch |
## build.yml - Build & Release
Currently disabled (`if: false` on the release job) until pack generation is
validated in production.
**Trigger.** Push to `main` on `bios/**` or `platforms/**` paths, or manual
`workflow_dispatch` with optional `force_release` flag to bypass rate limiting.
**Concurrency.** Group `build`, cancel in-progress.
**Steps:**
1. Checkout, Python 3.12, install `pyyaml`
2. Run `test_e2e`
3. Rate limit check: skip if last release was less than 7 days ago (unless
`force_release` is set)
4. Restore large files from the `large-files` release into `.cache/large/`
5. Refresh data directories (`refresh_data_dirs.py`)
6. Build packs (`generate_pack.py --all --output-dir dist/`)
7. Create GitHub release with tag `v{YYYY.MM.DD}` (appends `.N` suffix if
a same-day release already exists)
8. Clean up old releases, keeping the 3 most recent plus `large-files`
**Release notes** include file count, total size, per-pack sizes, and the last
15 non-merge commits touching `bios/` or `platforms/`.
## deploy-site.yml - Deploy Documentation Site
**Trigger.** Push to `main` when any of these paths change: `platforms/`,
`emulators/`, `wiki/`, `scripts/generate_site.py`, `scripts/generate_readme.py`,
`scripts/verify.py`, `scripts/common.py`, `database.json`, `mkdocs.yml`.
Also manual dispatch.
**Steps:**
1. Checkout, Python 3.12
2. Install `pyyaml`, `mkdocs-material`, `pymdown-extensions`
3. Run `generate_site.py` (converts YAML data into MkDocs pages)
4. Run `generate_readme.py` (rebuilds README.md and CONTRIBUTING.md)
5. `mkdocs build` to produce the static site
6. Upload artifact, deploy to GitHub Pages
The site is deployed via the `github-pages` environment using the official
`actions/deploy-pages` action.
## validate.yml - PR Validation
**Trigger.** Pull requests that modify `bios/**` or `platforms/**`.
**Concurrency.** Per-PR group, cancel in-progress.
Four parallel jobs:
**validate-bios.** Diffs the PR to find changed BIOS files, runs
`validate_pr.py --markdown` on each, and posts the validation report as a PR
comment (hash verification, database match status).
**validate-configs.** Validates all platform YAML files against
`schemas/platform.schema.json` using `jsonschema`. Fails if any config does
not match the schema.
**run-tests.** Runs `python -m unittest tests.test_e2e -v`. Must pass before
merge.
**label-pr.** Auto-labels the PR based on changed paths:
| Path pattern | Label |
|-------------|-------|
| `bios/` | `bios` |
| `bios/{Manufacturer}/` | `system:{manufacturer}` |
| `platforms/` | `platform-config` |
| `scripts/` | `automation` |
## watch.yml - Weekly Platform Sync
**Trigger.** Cron schedule every Monday at 06:00 UTC, or manual dispatch.
**Flow:**
1. Scrape live upstream sources (System.dat, batocera-systems, es_bios.xml,
etc.) and regenerate platform YAML configs
2. Auto-fetch missing BIOS files
3. Refresh data directories
4. Run dedup
5. Regenerate `database.json`
6. Create or update a PR with labels `automated` and `platform-update`
The PR contains all changes from the scrape cycle. A maintainer reviews and
merges.
## Large files management
Files larger than 50 MB are stored as assets on a permanent GitHub release
named `large-files` (to keep the git repository lightweight).
Known large files: PS3UPDAT.PUP, PSVUPDAT.PUP, PSP2UPDAT.PUP, dsi_nand.bin,
maclc3.zip, Firmware.19.0.0.zip (Switch).
**Storage.** Listed in `.gitignore` so they stay out of git history. The
`large-files` release is excluded from cleanup (the build workflow only
deletes version-tagged releases).
**Build-time restore.** The build workflow downloads all assets from
`large-files` into `.cache/large/` and copies them to their expected paths
before pack generation.
**Upload.** To add or update a large file:
```bash
gh release upload large-files "bios/Sony/PS3/PS3UPDAT.PUP#PS3UPDAT.PUP"
```
**Local cache.** `generate_pack.py` calls `fetch_large_file()` which downloads
from the release and caches in `.cache/large/` for subsequent runs.
## Manual release process
When `build.yml` is disabled, build and release manually:
```bash
# Run the full pipeline (DB + verify + packs + consistency check)
python scripts/pipeline.py --offline
# Or step by step:
python scripts/generate_db.py --force --bios-dir bios --output database.json
python scripts/verify.py --all
python scripts/generate_pack.py --all --output-dir dist/
# Create the release
DATE=$(date +%Y.%m.%d)
gh release create "v${DATE}" dist/*.zip \
--title "BIOS Pack v${DATE}" \
--notes "Release notes here" \
--latest
```
To re-enable automated releases, remove the `if: false` guard from the
`release` job in `build.yml`.

148
wiki/testing-guide.md Normal file
View File

@@ -0,0 +1,148 @@
# Testing Guide
This page covers how to run, understand, and extend the test suite.
All tests use synthetic fixtures. No real BIOS files, platform configs, or
network access required.
## Running tests
Run a single test module:
```bash
python -m unittest tests.test_e2e -v
python -m unittest tests.test_mame_parser -v
python -m unittest tests.test_fbneo_parser -v
python -m unittest tests.test_hash_merge -v
```
Run the full suite:
```bash
python -m unittest discover tests -v
```
The only dependency is `pyyaml`. No test framework beyond the standard
library `unittest` module.
## Test architecture
### test_e2e.py
The main regression suite. A single `TestE2E` class exercises every code path
through the resolution, verification, pack generation, and cross-reference
logic.
**Fixture pattern.** `setUp` creates a temporary directory tree with:
- Fake BIOS files (deterministic content for hash computation)
- Platform YAML configs (existence mode, MD5 mode, inheritance, shared groups)
- Emulator profile YAMLs (required/optional files, aliases, HLE, standalone)
- A synthetic `database.json` keyed by SHA1
`tearDown` removes the temporary tree.
**Test numbering.** Tests are grouped by category:
| Range | Category |
|-------|----------|
| `test_01`--`test_14` | File resolution (SHA1, MD5, name, alias, truncated MD5, composite, zip contents, variants, hash mismatch) |
| `test_20`--`test_31` | Verification (existence mode, MD5 mode, required/optional severity, zipped file, multi-hash) |
| `test_40`--`test_47` | Cross-reference (undeclared files, standalone skip, alias profiles, data dir suppression, exclusion notes) |
| `test_50`+ | Platform config (inheritance, shared groups, data directories, grouping, core resolution, target filtering, ground truth) |
Each test calls the same functions that `verify.py` and `generate_pack.py` use
in production, against the synthetic fixtures.
### Parser tests
**test_mame_parser.** Tests the MAME C source parser that extracts BIOS root
sets from driver files. Fixtures are inline C source snippets containing
`ROM_START`, `ROM_LOAD`, `GAME()`/`COMP()` macros with
`MACHINE_IS_BIOS_ROOT`. Tests cover:
- Standard `GAME` macro detection
- `COMP` macro detection
- `ROM_LOAD` / `ROMX_LOAD` parsing (name, size, CRC32, SHA1)
- `ROM_SYSTEM_BIOS` variant extraction
- Multi-region ROM blocks
- Macro expansion and edge cases
**test_fbneo_parser.** Tests the FBNeo C source parser that identifies
`BDF_BOARDROM` sets. Same inline fixture approach.
**test_hash_merge.** Tests the text-based YAML patching module used to merge
upstream BIOS hashes into emulator profiles. Covers:
- Merge operations (add new hashes, update existing)
- Diff computation (detect what changed)
- Formatting preservation (comments, ordering, flow style)
Fixtures are programmatically generated YAML/JSON files written to a temp
directory.
## How to add a test
1. **Pick the right category.** Find the number range that matches the
subsystem you are testing. If none fits, start a new range after the last
existing one.
2. **Create synthetic fixtures.** Write the minimum YAML configs and fake
files needed to isolate the behavior. Use `tempfile.mkdtemp` for a clean
workspace. Avoid depending on the repo's real `bios/` or `platforms/`
directories.
3. **Call production functions.** Import from `common`, `verify`, `validation`,
or `truth` and call the same entry points that the CLI scripts use. Do not
re-implement logic in tests.
4. **Assert specific outcomes.** Check `Status`, `Severity`, resolution
method, file counts, or pack contents. Avoid brittle assertions on log
output or formatting.
5. **Run the full suite.** After adding your test, run `python -m unittest
discover tests -v` to verify nothing else broke.
Example skeleton:
```python
def test_42_my_new_behavior(self):
# Write minimal fixtures to self.root
profile = {"emulator": "test_core", "files": [...]}
with open(os.path.join(self.emulators_dir, "test_core.yml"), "w") as f:
yaml.dump(profile, f)
# Call production code
result = verify_platform(self.config, self.db, ...)
# Assert specific outcomes
self.assertEqual(result[0]["status"], Status.OK)
```
## Verification discipline
The test suite is one layer of verification. The full quality gate is:
1. All unit tests pass (`python -m unittest discover tests`)
2. The full pipeline completes without error (`python scripts/pipeline.py --offline`)
3. No unexpected CRITICAL entries in the verify output
4. Pack file counts match verification file counts (consistency check)
If a change passes tests but breaks the pipeline, it's worth investigating before merging. Similarly, new CRITICAL entries in the verify output after a change usually indicate something to look into. The pipeline is designed so that all steps agree: if verify reports N files for a platform, the pack should contain exactly N files.
Ideally, tests, code, and documentation ship together. When profiles and platform configs are involved, updating them in the same change helps keep everything in sync.
## CI integration
The `validate.yml` workflow runs `test_e2e` on every pull request that touches
`bios/` or `platforms/` files. The test job (`run-tests`) runs in parallel
with BIOS validation, schema validation, and auto-labeling.
Tests must pass before merge. If a test fails in CI, reproduce locally with:
```bash
python -m unittest tests.test_e2e -v 2>&1 | head -50
```
The `build.yml` workflow also runs the test suite before building release
packs.

View File

@@ -7,11 +7,35 @@ All tools are Python scripts in `scripts/`. Single dependency: `pyyaml`.
Run everything in sequence:
```bash
python scripts/pipeline.py --offline # DB + verify + packs + readme + site
python scripts/pipeline.py --offline # DB + verify + packs + manifests + readme + site
python scripts/pipeline.py --offline --skip-packs # DB + verify only
python scripts/pipeline.py --skip-docs # skip readme + site generation
python scripts/pipeline.py --offline --skip-docs # skip readme + site generation
python scripts/pipeline.py --offline --target switch # filter by hardware target
python scripts/pipeline.py --offline --with-truth # include truth generation + diff
python scripts/pipeline.py --offline --with-export # include native format export
python scripts/pipeline.py --check-buildbot # check buildbot data freshness
```
Pipeline steps:
| Step | Description | Skipped by |
|------|-------------|------------|
| 1/9 | Generate database | - |
| 2/9 | Refresh data directories | `--offline` |
| 2a | Refresh MAME BIOS hashes | `--offline` |
| 2a2 | Refresh FBNeo BIOS hashes | `--offline` |
| 2b | Check buildbot staleness | only with `--check-buildbot` |
| 2c | Generate truth YAMLs | only with `--with-truth` / `--with-export` |
| 2d | Diff truth vs scraped | only with `--with-truth` / `--with-export` |
| 2e | Export native formats | only with `--with-export` |
| 3/9 | Verify all platforms | - |
| 4/9 | Generate packs | `--skip-packs` |
| 4b | Generate install manifests | `--skip-packs` |
| 4c | Generate target manifests | `--skip-packs` |
| 5/9 | Consistency check | if verify or pack skipped |
| 8/9 | Generate README | `--skip-docs` |
| 9/9 | Generate site | `--skip-docs` |
## Individual tools
### generate_db.py
@@ -29,10 +53,16 @@ python scripts/generate_db.py --force --bios-dir bios --output database.json
Check BIOS coverage for each platform using its native verification mode.
```bash
python scripts/verify.py --all # all platforms
python scripts/verify.py --platform batocera # single platform
python scripts/verify.py --emulator dolphin # single emulator
python scripts/verify.py --system atari-lynx # single system
python scripts/verify.py --all # all platforms
python scripts/verify.py --platform batocera # single platform
python scripts/verify.py --platform retroarch --verbose # with ground truth details
python scripts/verify.py --emulator dolphin # single emulator
python scripts/verify.py --emulator dolphin --standalone # standalone mode only
python scripts/verify.py --system atari-lynx # single system
python scripts/verify.py --platform retroarch --target switch # filter by hardware
python scripts/verify.py --list-emulators # list all emulators
python scripts/verify.py --list-systems # list all systems
python scripts/verify.py --platform retroarch --list-targets # list available targets
```
Verification modes per platform:
@@ -45,6 +75,7 @@ Verification modes per platform:
| EmuDeck | md5 | MD5 whitelist per system |
| RetroDECK | md5 | MD5 per file via component manifests |
| RomM | md5 | size + any hash (MD5/SHA1/CRC32) |
| BizHawk | sha1 | SHA1 per firmware from FirmwareDatabase.cs |
### generate_pack.py
@@ -67,6 +98,14 @@ python scripts/generate_pack.py --platform retroarch --split --group-by manufact
python scripts/generate_pack.py --from-md5 d8f1206299c48946e6ec5ef96d014eaa
python scripts/generate_pack.py --platform batocera --from-md5-file missing.txt
python scripts/generate_pack.py --platform retroarch --list-systems
# Hardware target filtering
python scripts/generate_pack.py --all --target x86_64
python scripts/generate_pack.py --platform retroarch --target switch
# Install manifests (consumed by install.py)
python scripts/generate_pack.py --all --manifest --output-dir install/
python scripts/generate_pack.py --manifest-targets --output-dir install/targets/
```
Packs include platform baseline files plus files required by the platform's cores.
@@ -82,17 +121,45 @@ If none exists, the platform version is kept and the discrepancy is reported.
- `--split --group-by manufacturer`: group split packs by manufacturer (Sony, Nintendo, Sega...)
- `--from-md5`: look up a hash in the database, or build a custom pack with `--platform`/`--emulator`
- `--from-md5-file`: same, reading hashes from a file (one per line, comments with #)
- `--target`: filter by hardware target (e.g. `switch`, `rpi4`, `x86_64`)
### cross_reference.py
Compare emulator profiles against platform configs.
Reports files that cores need but platforms don't declare.
Reports files that cores need beyond what platforms declare.
```bash
python scripts/cross_reference.py # all
python scripts/cross_reference.py --emulator dolphin # single
python scripts/cross_reference.py --emulator dolphin # single
python scripts/cross_reference.py --emulator dolphin --json # JSON output
```
### truth.py, generate_truth.py, diff_truth.py
Generate ground truth from emulator profiles, diff against scraped platform data.
```bash
python scripts/generate_truth.py --platform retroarch # single platform truth
python scripts/generate_truth.py --all --output-dir dist/truth/ # all platforms
python scripts/diff_truth.py --platform retroarch # diff truth vs scraped
python scripts/diff_truth.py --all # diff all platforms
```
### export_native.py
Export truth data to native platform formats (System.dat, es_bios.xml, checkBIOS.sh, etc.).
```bash
python scripts/export_native.py --platform batocera
python scripts/export_native.py --all --output-dir dist/upstream/
```
### validation.py
Validation index and ground truth formatting. Used by verify.py for emulator-level checks
(size, CRC32, MD5, SHA1, crypto). Separates reproducible hash checks from cryptographic
validations that require console-specific keys.
### refresh_data_dirs.py
Fetch data directories (Dolphin Sys, PPSSPP assets, blueMSX databases)
@@ -107,24 +174,45 @@ python scripts/refresh_data_dirs.py --key dolphin-sys --force
| Script | Purpose |
|--------|---------|
| `common.py` | Shared library: hash computation, file resolution, platform config loading, emulator profiles, target filtering |
| `dedup.py` | Deduplicate `bios/`, move duplicates to `.variants/`. RPG Maker and ScummVM excluded (NODEDUP) |
| `validate_pr.py` | Validate BIOS files in pull requests |
| `auto_fetch.py` | Fetch missing BIOS files from known sources |
| `validate_pr.py` | Validate BIOS files in pull requests, post markdown report |
| `auto_fetch.py` | Fetch missing BIOS files from known sources (4-step pipeline) |
| `list_platforms.py` | List active platforms (used by CI) |
| `download.py` | Download packs from GitHub releases |
| `common.py` | Shared library: hash computation, file resolution, platform config loading, emulator profiles |
| `download.py` | Download packs from GitHub releases (Python, multi-threaded) |
| `generate_readme.py` | Generate README.md and CONTRIBUTING.md from database |
| `generate_site.py` | Generate all MkDocs site pages (this documentation) |
| `deterministic_zip.py` | Rebuild MAME BIOS ZIPs deterministically (same ROMs = same hash) |
| `crypto_verify.py` | 3DS RSA signature and AES crypto verification |
| `sect233r1.py` | Pure Python ECDSA verification on sect233r1 curve (3DS OTP cert) |
| `batch_profile.py` | Batch profiling automation for libretro cores |
| `check_buildbot_system.py` | Detect stale data directories by comparing with buildbot |
| `migrate.py` | Migrate flat bios structure to Manufacturer/Console/ hierarchy |
## Installation tools
Cross-platform BIOS installer for end users:
```bash
# Python installer (auto-detects platform)
python install.py
# Shell one-liner (Linux/macOS)
bash scripts/download.sh retroarch ~/RetroArch/system/
bash scripts/download.sh --list
# Or via install.sh wrapper (detects curl/wget, runs install.py)
bash install.sh
```
`install.py` auto-detects the user's platform by checking config files,
downloads the matching BIOS pack from GitHub releases with SHA1 verification,
and extracts files to the correct directory. `install.ps1` provides
equivalent functionality for Windows/PowerShell.
## Large files
Files over 50 MB are stored as assets on the `large-files` GitHub release.
They are listed in `.gitignore` so they don't bloat the git repository.
They are listed in `.gitignore` to keep the git repository lightweight.
`generate_db.py` downloads them from the release when rebuilding the database,
using `fetch_large_file()` from `common.py`. The same function is used by
`generate_pack.py` when a file has a hash mismatch with the local variant.
@@ -141,11 +229,49 @@ Located in `scripts/scraper/`. Each inherits `BaseScraper` and implements `fetch
| `retrobat_scraper` | batocera-systems.json | JSON |
| `emudeck_scraper` | checkBIOS.sh | Bash + CSV |
| `retrodeck_scraper` | component manifests | JSON per component |
| `romm_scraper` | known_bios_files.json | JSON |
| `coreinfo_scraper` | .info files from libretro-core-info | INI-like |
| `bizhawk_scraper` | FirmwareDatabase.cs | C# source |
| `mame_hash_scraper` | mamedev/mame source tree | C source (sparse clone) |
| `fbneo_hash_scraper` | FBNeo source tree | C source (sparse clone) |
Internal modules: `base_scraper.py` (abstract base with `_fetch_raw()` caching
and shared CLI), `dat_parser.py` (clrmamepro DAT format parser).
and shared CLI), `dat_parser.py` (clrmamepro DAT format parser),
`mame_parser.py` (MAME C source BIOS root set parser),
`fbneo_parser.py` (FBNeo C source BIOS set parser),
`_hash_merge.py` (text-based YAML patching that preserves formatting).
Adding a scraper: inherit `BaseScraper`, implement `fetch_requirements()`,
call `scraper_cli(YourScraper)` in `__main__`.
## Target scrapers
Located in `scripts/scraper/targets/`. Each inherits `BaseTargetScraper` and implements `fetch_targets()`.
| Scraper | Source | Targets |
|---------|--------|---------|
| `retroarch_targets_scraper` | libretro buildbot nightly | 20+ architectures |
| `batocera_targets_scraper` | Config.in + es_systems.yml | 35+ boards |
| `emudeck_targets_scraper` | EmuScripts GitHub API | steamos, windows |
| `retropie_targets_scraper` | scriptmodules + rp_module_flags | 7 platforms |
```bash
python -m scripts.scraper.targets.retroarch_targets_scraper --dry-run
python -m scripts.scraper.targets.batocera_targets_scraper --dry-run
```
## Exporters
Located in `scripts/exporter/`. Each inherits `BaseExporter` and implements `export()`.
| Exporter | Output format |
|----------|--------------|
| `systemdat_exporter` | clrmamepro DAT (RetroArch System.dat) |
| `batocera_exporter` | Python dict (batocera-systems) |
| `recalbox_exporter` | XML (es_bios.xml) |
| `retrobat_exporter` | JSON (batocera-systems.json) |
| `emudeck_exporter` | Bash script (checkBIOS.sh) |
| `retrodeck_exporter` | JSON (component_manifest.json) |
| `romm_exporter` | JSON (known_bios_files.json) |
| `lakka_exporter` | clrmamepro DAT (delegates to systemdat) |
| `retropie_exporter` | clrmamepro DAT (delegates to systemdat) |

243
wiki/troubleshooting.md Normal file
View File

@@ -0,0 +1,243 @@
# Troubleshooting - RetroBIOS
Diagnosis guide organized by symptom. Each section describes what to check and how to fix it.
## Game won't start / black screen
Most launch failures are caused by a missing or incorrect BIOS file.
**Check if the BIOS exists:**
```bash
python scripts/verify.py --platform retroarch --verbose
python scripts/verify.py --system sony-playstation
```
Look for `MISSING` entries in the output. A missing required BIOS means the core
cannot start games for that system at all.
**Check if the hash matches:**
Look for `HASH_MISMATCH` in the verify output. This means the file exists but
contains different data than expected. Common causes:
- Wrong region (a PAL BIOS instead of NTSC, or vice versa)
- Wrong hardware revision (e.g. SCPH-5501 vs SCPH-1001 for PlayStation)
- Corrupted download
Each system page on the site lists the expected hashes. Compare your file's
MD5 or SHA1 against those values.
**Wrong region BIOS:**
Some cores require region-specific BIOS files. A Japanese BIOS won't boot
North American games on cores that enforce region matching. Check the emulator
profile for your core to see which regions are supported and which files
correspond to each.
## BIOS not found by emulator
The file exists on disk, but the emulator reports it as missing.
**Wrong directory:**
Each platform expects BIOS files in a specific base directory:
- RetroArch, Lakka: `system/` inside the RetroArch directory
- Batocera: `/userdata/bios/`
- Recalbox: `/recalbox/share/bios/`
- RetroPie: `~/RetroPie/BIOS/`
Some cores expect files in subdirectories (e.g. `dc/` for Dreamcast, `pcsx2/bios/`
for PlayStation 2). Check the `path:` field in the emulator profile for the exact
expected location relative to the base directory.
**Wrong filename:**
Cores match BIOS files by exact filename. If a core expects `scph5501.bin` and your
file is named `SCPH-5501.BIN`, it won't be found on platforms that do exact name matching.
Check the emulator profile for the expected filename and any aliases listed under
`aliases:`. Aliases are alternative names that the core also accepts.
**Case sensitivity:**
Linux filesystems are case-sensitive. A file named `Bios.ROM` won't match a lookup
for `bios.rom`. Windows and macOS are case-insensitive by default, so the same
file works there but fails on Linux.
Batocera's verification uses `casefold()` for case-insensitive matching, but
the actual emulator may still require exact case. When in doubt, use the exact
filename from the emulator profile.
## Hash mismatch / UNTESTED
`verify.py` reports `HASH_MISMATCH` or `UNTESTED` for a file.
**HASH_MISMATCH:**
The file exists and was hashed, but the computed hash doesn't match any expected
value. This means you have a different version of the file than what the platform
or emulator expects.
To find the correct version, check the system page on the site. It lists every
known BIOS file with its expected MD5 and SHA1.
**UNTESTED:**
On existence-only platforms (RetroArch, Lakka, RetroPie), the file is present
but its hash was not verified against a known value. The platform itself only
checks that the file exists. The `--verbose` flag shows ground truth data from
emulator profiles, which can confirm whether the file's hash is actually correct.
**The .variants/ directory:**
When multiple versions of the same BIOS exist (different revisions, regions, or
dumps), the primary version lives in the main directory and alternatives live in
`.variants/`. `verify.py` checks the primary file first, then falls back to
variants when resolving by hash.
If your file matches a variant hash but not the primary, it's a valid BIOS --
just not the preferred version. Some cores accept multiple versions.
## Pack is missing files
A generated pack doesn't contain all the files you expected.
**Severity levels:**
`verify.py` assigns a severity to each issue. Not all missing files are equally
important:
| Severity | Meaning | Action needed |
|----------|---------|---------------|
| CRITICAL | Required file missing or hash mismatch on MD5 platforms | Must fix. Core won't function. |
| WARNING | Optional file missing, or hash mismatch on existence platforms | Core works but with reduced functionality. |
| INFO | Optional file missing on existence-only platforms, or HLE fallback available | Core works fine, BIOS improves accuracy. |
| OK | File present and verified | No action needed. |
Focus on CRITICAL issues first. WARNING files improve the experience but aren't
strictly necessary. INFO files are nice to have.
**Large files (over 50 MB):**
Files like PS3UPDAT.PUP, PSVUPDAT.PUP, and Switch firmware are too large for the
git repository. They are stored as GitHub release assets under the `large-files`
release and downloaded at build time.
If a pack build fails to include these, check your network connection. In offline
mode (`--offline`), large files are only included if already cached locally in
`.cache/large/`.
**Data directories:**
Some cores need entire directory trees rather than individual files (e.g. Dolphin's
`Sys/` directory, PPSSPP's `assets/`). These are fetched by `refresh_data_dirs.py`
from upstream repositories.
In offline mode, data directories are only included if already cached in `data/`.
Run `python scripts/refresh_data_dirs.py` to fetch them.
## verify.py reports errors
How to read and interpret `verify.py` output.
**Status codes:**
| Status | Meaning |
|--------|---------|
| `ok` | File present, hash matches (or existence check passed) |
| `untested` | File present, hash not confirmed (existence-only platforms) |
| `missing` | File not found in the repository |
| `hash_mismatch` | File found but hash doesn't match expected value |
| `size_mismatch` | File found but size doesn't match what the emulator expects |
**Reading the output:**
Each line shows the file path, its status, and severity. In verbose mode, ground
truth data from emulator profiles is appended, showing which cores reference the
file and what validations they perform.
```
scph5501.bin ok [OK]
dc_boot.bin missing [CRITICAL]
gba_bios.bin untested [WARNING]
```
**Cross-reference section:**
After per-file results, `verify.py` prints a cross-reference report. This lists
files that emulator cores need but that the platform YAML doesn't declare. These
files are still included in packs automatically, but the report helps identify
gaps in platform coverage data.
The cross-reference uses `resolve_platform_cores()` to determine which emulator
profiles are relevant for each platform, then checks whether each profile's files
appear in the platform config.
**Filtering output:**
```bash
# By platform
python scripts/verify.py --platform batocera
# By emulator core
python scripts/verify.py --emulator beetle_psx
# By system
python scripts/verify.py --system sony-playstation
# By hardware target
python scripts/verify.py --platform retroarch --target switch
# JSON for scripted processing
python scripts/verify.py --platform retroarch --json
```
## Installation script fails
Problems with `install.py`, `install.sh`, or `download.sh`.
**Network issues:**
The installer downloads packs from GitHub releases. If the download fails:
- Check your internet connection
- Verify that `https://github.com` is reachable
- If behind a proxy, set `HTTPS_PROXY` in your environment
- Try again later if GitHub is experiencing issues
**Permission denied:**
The installer needs write access to the target directory.
- On Linux/macOS: check directory ownership (`ls -la`) and run with appropriate
permissions. Avoid running as root unless the target directory requires it.
- On Windows: run PowerShell as Administrator if installing to a protected directory.
**Platform not detected:**
`install.py` auto-detects your platform by checking for known config files. If
detection fails, specify the platform manually:
```bash
python install.py --platform retroarch --dest ~/RetroArch/system/
python install.py --platform batocera --dest /userdata/bios/
```
Use `python install.py --help` to see all available platforms and options.
**Pack not found in release:**
If the installer reports that no pack exists for your platform, check available
releases:
```bash
python scripts/download.py --list
# or
bash scripts/download.sh --list
```
Some platforms share packs (Lakka uses the RetroArch pack). The installer handles
this mapping automatically, but if you're downloading manually, check which pack
name corresponds to your platform.

248
wiki/verification-modes.md Normal file
View File

@@ -0,0 +1,248 @@
# Verification Modes
Each platform verifies BIOS files differently. `verify.py` replicates the native behavior
of each platform so that verification results match what the platform itself would report.
## Existence Mode
**Platforms**: RetroArch, Lakka, RetroPie
**Source**: RetroArch `core_info.c`, function `path_is_valid()`
The most straightforward mode. A file is OK if it exists at the expected path. No hash is checked.
Any file with the correct name passes, regardless of content.
| Condition | Status | Severity (required) | Severity (optional) |
|-----------|--------|---------------------|---------------------|
| File present | OK | OK | OK |
| File missing | MISSING | WARNING | INFO |
RetroArch does not distinguish between a correct and an incorrect BIOS at the verification
level. A corrupt or wrong-region file still shows as present. This is by design in the
upstream code: `core_info.c` only calls `path_is_valid()` and does not open or hash the file.
Lakka and RetroPie inherit this behavior through platform config inheritance
(`inherits: retroarch` in the platform YAML).
## MD5 Mode
**Platforms**: Batocera, RetroBat, Recalbox, EmuDeck, RetroDECK, RomM
All MD5-mode platforms compute a hash of the file and compare it against an expected value.
The details vary by platform.
### Standard MD5 (Batocera, RetroBat)
`verify.py` replicates Batocera's `md5sum()` function. The file is read in binary mode,
hashed with MD5, and compared case-insensitively against the expected value.
| Condition | Status | Severity (required) | Severity (optional) |
|-----------|--------|---------------------|---------------------|
| Hash matches | OK | OK | OK |
| File present, hash differs | UNTESTED | WARNING | WARNING |
| File missing | MISSING | CRITICAL | WARNING |
If the `resolve_local_file` step already confirmed the MD5 match (status `md5_exact`),
`verify.py` skips re-hashing and returns OK directly.
### Truncated MD5 (Batocera bug)
Some entries in Batocera's system data contain 29-character MD5 strings instead of
the standard 32. This is a known upstream bug. `verify.py` handles it by prefix matching:
if the expected hash is shorter than 32 characters, the actual hash is compared against
only its first N characters.
### md5_composite (Recalbox ZIP verification)
Recalbox computes `Zip::Md5Composite` for ZIP files: the MD5 of the concatenation of all
inner file MD5s (sorted by filename). `verify.py` replicates this with `md5_composite()`
from `common.py`. When a ZIP file's direct MD5 does not match, the composite is tried
before reporting a mismatch.
### Multi-hash (Recalbox)
Recalbox allows comma-separated MD5 values for a single file entry, accepting any one
of them as valid. `verify.py` splits on commas and tries each hash. A match against any
listed hash is OK.
### Mandatory levels (Recalbox)
Recalbox uses three severity levels derived from two YAML fields (`mandatory` and
`hashMatchMandatory`):
| mandatory | hashMatchMandatory | Color | verify.py mapping |
|-----------|--------------------|--------|-------------------|
| true | true | RED | CRITICAL |
| true | false | YELLOW | WARNING |
| false | (any) | GREEN | INFO |
### checkInsideZip (Batocera zippedFile)
When a platform entry has a `zipped_file` field, the expected MD5 is not the hash of the
ZIP container but of a specific ROM file inside the ZIP. `verify.py` replicates Batocera's
`checkInsideZip()`:
1. Open the ZIP.
2. Find the inner file by name (case-insensitive via `casefold()`).
3. Read its contents and compute MD5.
4. Compare against the expected hash.
If the inner file is not found inside the ZIP, the status is UNTESTED with a reason string.
### RomM verification
RomM checks both file size and hash. It accepts any hash type (MD5, SHA1, or CRC32).
ZIP files are not opened; only the container is checked. `verify.py` replicates this
by checking size first, then trying each available hash.
## SHA1 Mode
**Platforms**: BizHawk
BizHawk firmware entries use SHA1 as the primary hash. `verify.py` computes SHA1
via `compute_hashes()` and compares case-insensitively.
| Condition | Status | Severity (required) | Severity (optional) |
|-----------|--------|---------------------|---------------------|
| SHA1 matches | OK | OK | OK |
| File present, SHA1 differs | UNTESTED | WARNING | WARNING |
| File missing | MISSING | CRITICAL | WARNING |
## Emulator-Level Validation
Independent of platform verification mode, `verify.py` runs emulator-level validation
from `validation.py`. This layer uses data from emulator profiles (YAML files in
`emulators/`), which are source-verified against emulator code.
### Validation index
`_build_validation_index()` reads all emulator profiles and builds a per-filename
index of validation rules. When multiple emulators reference the same file, checks
are merged (union of all check types). Conflicting expected values are kept as sets
(e.g., multiple accepted CRC32 values for different ROM versions).
Each entry in the index tracks:
- `checks`: list of validation types (e.g., `["size", "crc32"]`)
- `sizes`: set of accepted exact sizes
- `min_size`, `max_size`: bounds when the code accepts a range
- `crc32`, `md5`, `sha1`, `sha256`: sets of accepted hash values
- `adler32`: set of accepted Adler-32 values
- `crypto_only`: non-reproducible checks (see below)
- `per_emulator`: per-core detail with source references
### Check categories
Validation checks fall into two categories:
**Reproducible** (`_HASH_CHECKS`): `crc32`, `md5`, `sha1`, `adler32`. These can be
computed from the file alone. `verify.py` calculates hashes and compares against
accepted values from the index.
**Non-reproducible** (`_CRYPTO_CHECKS`): `signature`, `crypto`. These require
console-specific cryptographic keys (e.g., RSA-2048 for 3DS, AES-128-CBC for certain
firmware). `verify.py` reports these as informational but cannot verify them without
the keys. Size checks still apply if combined with crypto.
### Size validation
Three forms:
- **Exact size**: `size: 524288` with `validation: [size]`. File must be exactly this many bytes.
- **Range**: `min_size: 40`, `max_size: 131076` with `validation: [size]`. File size must fall within bounds.
- **Informational**: `size: 524288` without `validation: [size]`. The size is documented but the emulator does not check it at runtime.
### Complement to platform checks
Emulator validation runs after platform verification. When a file passes platform checks
(e.g., existence-mode OK) but fails emulator validation (e.g., wrong CRC32), the result
includes a `discrepancy` field:
```
file present (OK) but handy says size mismatch: got 256, accepted [512]
```
This catches cases where a file has the right name but wrong content, which existence-mode
platforms cannot detect.
## Severity Matrix
`compute_severity()` maps the combination of status, required flag, verification mode,
and HLE fallback to a severity level.
| Mode | Status | required | hle_fallback | Severity |
|------|--------|----------|--------------|----------|
| any | OK | any | any | OK |
| any | MISSING | any | true | INFO |
| existence | MISSING | true | false | WARNING |
| existence | MISSING | false | false | INFO |
| md5/sha1 | MISSING | true | false | CRITICAL |
| md5/sha1 | MISSING | false | false | WARNING |
| md5/sha1 | UNTESTED | any | false | WARNING |
**HLE fallback**: when an emulator profile marks a file with `hle_fallback: true`, the
core has a built-in high-level emulation path and functions without the file. Missing
files are downgraded to INFO regardless of platform mode or required status. The file
is still included in packs (better accuracy with the real BIOS), but its absence is not
actionable.
## File Resolution Chain
Before verification, each file entry is resolved to a local path by `resolve_local_file()`.
The function tries these steps in order, returning the first match:
| Step | Method | Returns | When it applies |
|------|--------|---------|-----------------|
| 0 | Path suffix exact | `exact` | `dest_hint` matches `by_path_suffix` index (regional variants with same filename, e.g., `GC/USA/IPL.bin` vs `GC/EUR/IPL.bin`) |
| 1 | SHA1 exact | `exact` | SHA1 present in the file entry and found in database |
| 2 | MD5 direct lookup | `md5_exact` | MD5 present, not a `zipped_file` entry, name matches (prevents cross-contamination from unrelated files sharing an MD5) |
| 3 | Name/alias existence | `exact` | No MD5 in entry; any file with matching name or alias exists. Prefers primary over `.variants/` |
| 4 | Name + md5_composite/MD5 | `exact` or `hash_mismatch` | Name matches, checks md5_composite for ZIPs and direct MD5 per candidate. Falls back to hash_mismatch if name matches but no hash does |
| 5 | ZIP contents index | `zip_exact` | `zipped_file` with MD5; searches inner ROM MD5 across all ZIPs when name-based resolution failed |
| 6 | MAME clone fallback | `mame_clone` | File was deduped; resolves via canonical set name (up to 3 levels deep) |
| 7 | Data directory scan | `data_dir` | Searches `data/` caches by exact path then case-insensitive basename walk |
| 8 | Agnostic fallback | `agnostic_fallback` | File entry marked `agnostic: true`; matches any file under the system path prefix within the size constraints |
If no step matches, the result is `(None, "not_found")`.
The `hash_mismatch` status at step 4 means a file with the right name exists but its hash
does not match. This still resolves to a local path (the file is present), but verification
will report it as UNTESTED with a reason string showing the expected vs actual hash prefix.
## Discrepancy Detection
When platform verification passes but emulator validation fails, the file has a discrepancy.
This happens most often in existence-mode platforms where any file with the right name is
accepted.
### Variant search
`_find_best_variant()` searches for an alternative file in the repository that satisfies
both the platform MD5 requirement and emulator validation:
1. Look up all files with the same name in the `by_name` index.
2. Skip the current file (already known to fail validation).
3. For each candidate, check that its MD5 matches the platform expectation.
4. Run `check_file_validation()` against the candidate.
5. Return the first candidate that passes both checks.
The search covers files in `.variants/` (alternate hashes stored during deduplication).
If a better variant is found, the pack uses it instead of the primary file. If no variant
satisfies both constraints, the platform version is kept and the discrepancy is reported
in the verification output.
### Practical example
A `scph5501.bin` file passes Batocera MD5 verification (hash matches upstream declaration)
but fails the emulator profile's size check because the profile was verified against a
different revision. `_find_best_variant` scans `.variants/scph5501.bin.*` for a file
that matches both the Batocera MD5 and the emulator's size expectation. If found, the
variant is used in the pack. If not, the Batocera-verified file is kept and the discrepancy
is logged.