# Adding a scraper

How to create or modify a scraper for fetching BIOS requirements from upstream
platform sources.

## Scraper architecture

### Plugin discovery

Scrapers are discovered automatically at import time. The `scripts/scraper/__init__.py`
module uses `pkgutil.iter_modules` to scan for files matching `*_scraper.py` in
the scraper directory. Each module must export:

- `PLATFORM_NAME: str` - the platform identifier (matches `_registry.yml`)
- `Scraper: class` - a subclass of `BaseScraper`

No registration code is needed. Drop a file, export the two names, and it works.

```python
# scripts/scraper/__init__.py (simplified)
for finder, name, ispkg in pkgutil.iter_modules([package_dir]):
    if not name.endswith("_scraper"):
        continue
    module = importlib.import_module(f".{name}", package=__package__)
    # looks for PLATFORM_NAME and Scraper attributes
```

### BaseScraper ABC

`BaseScraper` (`scripts/scraper/base_scraper.py`) provides the foundation:

```
BaseScraper
  __init__(url: str)
  _fetch_raw() -> str               # HTTP GET, cached, 50 MB limit
  fetch_requirements() -> list       # abstract: parse upstream data
  validate_format(raw_data) -> bool  # abstract: detect format changes
  compare_with_config(config) -> ChangeSet  # diff against existing YAML
  test_connection() -> bool          # reachability check
```

`_fetch_raw()` handles HTTP with `urllib.request`, sets a `retrobios-scraper/1.0`
user-agent, enforces a 30-second timeout, and reads the response in 64 KB chunks
with a 50 MB hard limit to prevent memory exhaustion. The result is cached on the
instance after the first call.

### BiosRequirement

A dataclass representing a single BIOS file entry:

```python
@dataclass
class BiosRequirement:
    name: str                        # filename
    system: str                      # retrobios system ID
    sha1: str | None = None
    md5: str | None = None
    crc32: str | None = None
    size: int | None = None
    destination: str = ""            # relative path in BIOS dir
    required: bool = True
    zipped_file: str | None = None   # ROM name inside a ZIP
    native_id: str | None = None     # original system name
```

### ChangeSet

Returned by `compare_with_config()`. Contains:

- `added: list[BiosRequirement]` - new files not in the existing config
- `removed: list[BiosRequirement]` - files present in config but gone upstream
- `modified: list[tuple[BiosRequirement, BiosRequirement]]` - hash changes
- `has_changes: bool` - True if any of the above are non-empty
- `summary() -> str` - human-readable summary (e.g. `+3 added, ~1 modified`)

### scraper_cli

`scraper_cli(scraper_class, description)` provides a shared CLI with three modes:

| Flag | Behavior |
|------|----------|
| `--dry-run` | Fetch and print a summary grouped by system |
| `--json` | Output all requirements as JSON |
| `--output FILE` | Write platform YAML to FILE |

When `--output` targets an existing file, the CLI preserves keys not generated by
the scraper (e.g. `data_directories`, manual additions). Only the `systems` section
is replaced. If the scraper defines `generate_platform_yaml()`, that method is used
instead of the generic YAML builder.

### Helper functions

Two additional functions in `base_scraper.py`:

- `fetch_github_latest_version(repo)` - fetches the latest release tag via GitHub API.
- `fetch_github_latest_tag(repo, prefix)` - fetches the most recent tag matching
  an optional prefix.

## Creating a BIOS scraper

### Minimal example

Based on the RomM scraper pattern (JSON source, flat structure):

```python
"""Scraper for MyPlatform BIOS requirements."""
from __future__ import annotations

import json
import sys

try:
    from .base_scraper import BaseScraper, BiosRequirement
except ImportError:
    from base_scraper import BaseScraper, BiosRequirement

PLATFORM_NAME = "myplatform"

SOURCE_URL = "https://raw.githubusercontent.com/org/repo/main/bios_list.json"

SLUG_MAP: dict[str, str] = {
    "psx": "sony-playstation",
    "saturn": "sega-saturn",
}


class Scraper(BaseScraper):
    def __init__(self, url: str = SOURCE_URL):
        super().__init__(url=url)

    def fetch_requirements(self) -> list[BiosRequirement]:
        raw = self._fetch_raw()
        if not self.validate_format(raw):
            raise ValueError("Format validation failed")

        data = json.loads(raw)
        requirements = []
        for entry in data:
            system = SLUG_MAP.get(entry["platform"])
            if not system:
                print(f"Warning: unmapped '{entry['platform']}'", file=sys.stderr)
                continue
            requirements.append(BiosRequirement(
                name=entry["filename"],
                system=system,
                md5=entry.get("md5"),
                sha1=entry.get("sha1"),
                size=entry.get("size"),
                destination=entry["filename"],
                required=entry.get("required", True),
            ))
        return requirements

    def validate_format(self, raw_data: str) -> bool:
        try:
            data = json.loads(raw_data)
        except (json.JSONDecodeError, TypeError):
            return False
        return isinstance(data, list) and len(data) > 0


def main():
    from scripts.scraper.base_scraper import scraper_cli
    scraper_cli(Scraper, "Scrape MyPlatform BIOS requirements")

if __name__ == "__main__":
    main()
```

### Parsing different upstream formats

Each platform stores its BIOS requirements differently. The scraper's job is to
normalize them into `BiosRequirement` entries.

| Format | Example | Parsing approach |
|--------|---------|-----------------|
| JSON | RomM `known_bios_files.json` | `json.loads()`, iterate keys |
| XML | Recalbox `es_bios.xml` | `xml.etree.ElementTree`, xpath or iter |
| clrmamepro DAT | RetroArch `System.dat` | Use `dat_parser` module (see below) |
| Python dict | Batocera `batocera-systems` | `ast.literal_eval` or regex extraction |
| Bash script | EmuDeck `checkBIOS.sh` | Line-by-line regex parsing |
| C# source | BizHawk `FirmwareDatabase.cs` | Regex for method calls and string literals |
| C source | MAME/FBNeo drivers | Use `mame_parser` or `fbneo_parser` (see below) |
| JSON (GitHub API) | RetroDECK component manifests | `json.loads()` per manifest file |

### System ID mapping

Every scraper maintains a `SLUG_MAP` (or equivalent) that translates the platform's
native system identifiers to retrobios system IDs. The retrobios system ID format
is `manufacturer-console` in lowercase with hyphens (e.g. `sony-playstation`,
`sega-mega-drive`, `nintendo-gba`).

When a native slug has no mapping, print a warning to stderr. This surfaces new
systems added upstream that need to be mapped.

System ID consistency matters for cross-platform operations. The same console must use the same ID across all scrapers and platforms. Before inventing a new ID, check existing profiles and platform YAMLs for precedent. The canonical format is `manufacturer-console` in lowercase with hyphens (e.g., `sony-playstation`, `sega-mega-drive`). The `SYSTEM_ALIASES` dict in `common.py` maps common variations to canonical IDs.

### Hash normalization

- Normalize all hashes to lowercase hex strings.
- Handle missing hashes gracefully (set to `None`, not empty string).
- Some platforms provide multiple hash types per entry. Populate whichever fields
  are available.
- Batocera uses 29-character truncated MD5 hashes in some entries. The resolution
  layer handles prefix matching, but the scraper should store the hash as-is.

Scraped data reflects what the upstream declares, which may not match reality. The scraper's job is faithful transcription of upstream data, not correction. Corrections happen in the emulator profiles (source-verified) and in `_shared.yml` (curated). If a scraper detects an obviously wrong hash or filename, log a warning but still include the upstream value. The divergence will surface during truth diffing.

## Creating a target scraper

Target scrapers determine which emulator cores are available on each hardware
target. They live in `scripts/scraper/targets/` and follow the same auto-discovery
pattern (`*_targets_scraper.py`).

### BaseTargetScraper ABC

```
BaseTargetScraper
  __init__(url: str)
  fetch_targets() -> dict    # abstract: return target data
  write_output(data, path)   # write YAML to disk
```

### Output format

`fetch_targets()` must return a dict with this structure:

```python
{
    "platform": "myplatform",
    "source": "https://...",
    "scraped_at": "2026-03-30T12:00:00Z",
    "targets": {
        "x86_64": {
            "architecture": "x86_64",
            "cores": ["beetle_psx", "dolphin", "snes9x"],
        },
        "rpi4": {
            "architecture": "aarch64",
            "cores": ["pcsx_rearmed", "mgba"],
        },
    },
}
```

The `targets` dict maps target names to their available cores. Core names must
match the names used in emulator profile `cores:` fields for the target filtering
pipeline to work correctly.

### Overrides

`platforms/targets/_overrides.yml` provides post-scrape adjustments:

- **aliases**: map alternate target names to canonical ones (e.g. `arm64` -> `aarch64`)
- **add_cores**: cores present on a target but not detected by the scraper
- **remove_cores**: cores detected by the scraper but not actually functional

Overrides are applied by `load_target_config()` in `common.py` after loading
the scraped data. The scraper itself does not need to handle overrides.

### Module contract

```python
from scripts.scraper.targets import BaseTargetScraper

PLATFORM_NAME = "myplatform_targets"

class Scraper(BaseTargetScraper):
    def __init__(self):
        super().__init__(url="https://...")

    def fetch_targets(self) -> dict:
        # Fetch and parse target data
        ...
```

Register the target scraper in `_registry.yml`:

```yaml
  myplatform:
    target_scraper: myplatform_targets
    target_source: https://...
```

### Existing target scrapers

| Scraper | Source | Approach |
|---------|--------|----------|
| `retroarch_targets` | libretro buildbot nightly | Scrape directory listings for each target arch |
| `batocera_targets` | Config.in + es_systems.yml | Cross-reference kernel config with system definitions |
| `emudeck_targets` | EmuScripts + RetroArch cores | GitHub API for script availability per OS |
| `retropie_targets` | scriptmodules + rp_module_flags | Parse Bash scriptmodules for platform flags |

## Parser modules

Shared parsers in `scripts/scraper/` handle formats used by multiple scrapers
or formats complex enough to warrant dedicated parsing logic.

### dat_parser

Parses clrmamepro DAT format as used in RetroArch's `System.dat`:

```
game (
    name "System"
    comment "Platform Name"
    rom ( name filename size 12345 crc ABCD1234 md5 ... sha1 ... )
)
```

Produces `DatRom` dataclass instances with `name`, `size`, `crc32`, `md5`, `sha1`,
and `system` fields. The `libretro_scraper` uses this parser.

### mame_parser

Parses MAME C source files to extract BIOS root sets. Handles:

- Machine declaration macros: `GAME`, `SYST`, `COMP`, `CONS`
- `MACHINE_IS_BIOS_ROOT` flag detection
- `ROM_START`/`ROM_END` blocks
- `ROM_LOAD` variants and `ROM_REGION` declarations
- `ROM_SYSTEM_BIOS` entries
- `NO_DUMP` filtering and `BAD_DUMP` flagging

Used by `mame_hash_scraper` to auto-fetch BIOS hashes from MAME driver sources
for each tagged MAME version.

### fbneo_parser

Parses FBNeo C source files:

- `BurnRomInfo` structs (static ROM arrays with name, size, CRC)
- `BurnDriver` structs (driver registration with `BDF_BOARDROM` flag)
- BIOS set identification via the boardrom flag

Used by `fbneo_hash_scraper` to extract BIOS ROM definitions.

### _hash_merge

Text-based YAML patching that merges fetched hash data into emulator profiles
while preserving formatting. Two strategies:

- **MAME**: updates `bios_zip` entries with `contents` lists (name, size, CRC32)
- **FBNeo**: updates individual ROM entries grouped by `archive` field

The merge preserves fields the hash data does not generate (system, note, required)
and leaves entries not present in the hash data untouched. Uses text-level YAML
manipulation rather than load-dump to maintain human-readable formatting.

## Testing

### Development workflow

1. **Start with --dry-run**. It's helpful to preview before writing output:

   ```bash
   python -m scripts.scraper.myplatform_scraper --dry-run
   ```

2. **Check JSON output** for data quality:

   ```bash
   python -m scripts.scraper.myplatform_scraper --json | python -m json.tool | head -50
   ```

3. **Compare with existing YAML** if updating a scraper:

   ```bash
   python -m scripts.scraper.myplatform_scraper --output /tmp/test.yml
   diff platforms/myplatform.yml /tmp/test.yml
   ```

4. **Run verification** after generating:

   ```bash
   python scripts/verify.py --platform myplatform
   python scripts/verify.py --platform myplatform --verbose
   ```

5. **Run the full pipeline** before committing:

   ```bash
   python scripts/pipeline.py --offline
   ```

### Round-trip testing

If an exporter exists for the platform, validate the scrape-export-compare cycle:

```bash
# Scrape upstream -> platform YAML
python -m scripts.scraper.myplatform_scraper --output /tmp/scraped.yml

# Export truth data -> native format
python scripts/export_native.py --platform myplatform --output /tmp/exported.json

# Compare
diff <(python -m scripts.scraper.myplatform_scraper --json | python -m json.tool) \
     /tmp/exported.json
```

### Common issues

| Symptom | Cause | Fix |
|---------|-------|-----|
| Unmapped slug warnings | New system added upstream | Add mapping to `SLUG_MAP` |
| Empty requirements list | Upstream format changed | Check `validate_format()`, update parser |
| Hash mismatch in verify | Upstream updated hashes | Re-scrape and regenerate platform YAML |
| Scraper hangs | URL unreachable, no timeout | `_fetch_raw()` has 30s timeout; check URL |
| `Response exceeds 50 MB` | Upstream file grew | Investigate; may need chunked parsing |
| `validate_format` fails | Upstream restructured | Update both `validate_format` and `fetch_requirements` |

### E2E tests

The project's test suite (`tests/test_e2e.py`) covers scraper integration at the
pipeline level. When adding a new scraper, verify that the full pipeline passes:

```bash
python -m unittest tests.test_e2e
python scripts/pipeline.py --offline
```

Both must pass before the scraper is considered complete.