# Historic public data expansion plan

Planning note for stacking additional official NHS public aggregate datasets (RDY) to support genuine trend analysis in agent-assisted public reports.

> Public aggregate data only. Not official Dorset HealthCare reporting. Human review required.

---

## Source status

### Already usable (no historic download needed)

| Source | Existing data | Report |
|--------|---------------|--------|
| MHSDS time series | 11 Provider months (MHS01, MHS29, MHS69) | `public-mh-access-profile.html` |
| Talking Therapies time series | 13 months (M001, M031, M053) | `public-talking-therapies-profile.html` |
| DSPT history | Multi-row assessment table | `public-assurance-profile.html` |

### Targets for historic expansion

| Source | Current gap | Target output |
|--------|-------------|---------------|
| CSDS | Single month (March 2026) | `processed/trend_csds_activity_rdy.csv` |
| A&E | Single month | `processed/trend_ae_rdy.csv` |
| DM01 | Single month | `processed/trend_dm01_rdy.csv` |
| KH03 | Mixed 2007–2024 snapshots in one file | `processed/trend_kh03_beds_rdy.csv`, `processed/latest_kh03_beds_rdy.csv` |
| FFT | No org-level RDY rows in summary XLSX | `processed/trend_fft_rdy.csv` or manual doc |
| MHSDS MHS23 | Not in time-series files | `processed/trend_mhs23_rdy.csv` or metadata note |

### Not expanded in this run

- NOF (quarterly only; not prioritised)
- Talking Therapies (13-month time series already sufficient)
- KO41a / ERIC (annual)
- CQC (context only)

---

## Risks

- **File size**: CSDS, DM01, MHSDS monthly ZIPs are large; downloads may be slow or fail mid-run
- **Scrape breakage**: NHS Digital / NHS England page structure changes break link discovery
- **Definition drift**: Measure IDs, column names or breakdowns may change between publication months
- **KH03 confusion**: Raw NHS “latest quarter” CSV may contain many historical snapshot dates in one file
- **FFT availability**: Public summary files may not include trust-level RDY rows
- **Provisional data**: Monthly statistics revise on final refresh; suppression and rounding apply

---

## Trend file schema (standard columns)

| Column | Description |
|--------|-------------|
| `source_id` | Register source identifier |
| `publication_period` | Publication slug (e.g. `march-2026`) |
| `reporting_period_start` | Period start (ISO or published format) |
| `reporting_period_end` | Period end |
| `org_code` | RDY |
| `org_name` | Trust name from source |
| `measure_id` | Measure or activity code where applicable |
| `measure_name` | Plain label (activity type, test name, etc.) |
| `metric_value` | Numeric value (NA if suppressed) |
| `metric_value_raw` | Published value including `*` |
| `source_file` | Basename of raw or processed input |
| `caveats` | Source-specific caution text |

No `_synthetic` column — trend files are public data only.

---

## Pipeline

1. `Rscript site/public-data/05_download_historic_public_data.R` — download, filter RDY, stack trends
2. `Rscript site/R/03_render_public_reports.R` — consume trend CSVs in reports

Each source runs in independent `tryCatch`; one failure does not stop the run.

**Scope fallback**: target 12 comparable months; accept **6 minimum** if downloads fail partway.

---

## Validation checklist

- [ ] Trend files exist where stacking succeeded
- [ ] Each headline measure has ≥2 distinct periods in stacked file
- [ ] No fabricated or synthetic historic values
- [ ] Reports state trend unavailable when stacked file missing
- [ ] No causal claims; governance footer preserved
- [ ] A&E trends framed as source validation, not ED performance
- [ ] KH03 report uses latest-quarter view, not mixed 2007 history as primary trend
- [ ] Static site links still work

---

## Register

Historic run metadata: `HISTORIC_SOURCE_REGISTER.csv`  
Run summary: `HISTORIC_PUBLIC_DATA_RUN_SUMMARY.md` (generated by script 05)
