Step 1 · Labeling
Define what counts as a roof-replacement event, and what every Gold column is for — before any modeling begins.
Labeling is the first thing the roofing pipeline locks. Step 1 answers two questions that everything downstream depends on: what counts as a valid roof-replacement positive, and what each raw Gold permit column is used for. Coverage (Step 2) and model training (Steps 3–5) both read this label definition as a fixed contract — so if the label is wrong, every metric above it is measuring the wrong thing.
This page is the parent overview of the data-layer labeling logic. The permit classifier itself — the two-axis permit_scope taxonomy that turns raw permit text into structured (type, action) pairs — has its own page.
1.1 · Variable inventory Done
The legacy pipeline hand-picked five text fields for the classifier and ignored the other ~48 columns with no written rationale. Step 1.1 closes that gap: every one of the 54 Gold columns is enumerated and assigned exactly one role. The inventory was generated by scripts/roofing/inventory_v3.py across the 31 FIPS on disk — all 54 columns are present in 31/31 FIPS, with no AHJ-specific columns.
The LABEL_SIGNAL columns
Only the seven LABEL_SIGNAL columns feed the permit classifier (Step 1.2). Everything else is held out for features or dropped. Null rates below are measured on Pinellas (12103).
| Column | Type | Null (Pinellas) | What it contributes |
|---|---|---|---|
| PROJECT_TYPE | array | — | Primary T1 signal — structured project-type array |
| TYPE | text | 0% | Primary tier signal — never null in Pinellas |
| SUBTYPE | text | 42.9% | Primary tier signal; drives repair-vs-replacement detection |
| DESCRIPTION | text | 40.5% | Primary free-text signal when populated |
| PROJECT_NAME | text | 65.1% | High-specificity signal when populated |
| BUSINESS_NAME | text | — | Auxiliary roof signal (e.g. "ABC ROOFING & TILE"); promoted to LABEL_SIGNAL in v3 |
| JOB_VALUE | numeric | — | Dual role: discriminates REPLACEMENT vs REPAIR and a downstream feature |
High-null DROP columns
Two columns are dropped on near-total-null evidence: UNITS (100% null in Pinellas) and STORIES (97.1% null). SUBDIVISION (75.9% null) was flagged as a drop candidate. The full 54-column table — role, dtype, median/max null rate, and distinct counts per FIPS — lives on the Gold variable inventory page.
1.2 · Permit taxonomy Done
The classifier reads only the LABEL_SIGNAL columns and emits a single MECE column, permit_scope — a list of (type, action) structs describing every object the permit touches and the verb applied to it. It is a two-axis alphabet: 20 types (ROOFING, BUILDING, HVAC, POOL, SHED, …) crossed with 7 actions (NEW · REPLACEMENT · REPAIR · ADDITION · ALTERATION · DEMOLITION · NA). The scope is never empty — a scopeless permit is recorded as [{OTHER, NA}] or [{UNKNOWN, NA}].
Repair vs replacement — which roof permits count
Within the roofing family, the positive-label question is precise: a re-roof is a replacement event, not a patch. The binding Step 4 rule classifies each roofing scope item by its action.
Positive-eligible — counts toward the target
Not a positive
Measured on Pinellas (12103): of 9,522 ROOFING items, 52.6% carry action = NA, REPLACEMENT 34.5%, NEW 5.9%, ALTERATION 4.6%, REPAIR 1.4%. There is no AMBIGUOUS action value — NA is the bucket the earlier ADR meant by "ambiguous": a roof permit whose action is unparseable, not a confirmed non-replacement.
1.3 · Owner-occupied-at-permit rule Permanent
A roof permit only counts as a valid positive if the owner was occupying the property at the time of the permit. This is a permanent client directive (CallZeke, 2026-05-26), in force for every training run and every shipped list from v18 onward.
oo_at_permit =
(silver.OwnerOccupied == 'Y' at the snapshot ≤30d pre-permit)
OR
(norm(MailingFullStreetAddress) == norm(SitusFullStreetAddress)
at the snapshot ≤30d pre-permit)
The occupancy test is measured against the silver snapshot dated ≤30 days before the permit issue date, using the same address normalizer the FA↔BuildZoom matcher uses. The rule is applied in three places:
| Where | Effect |
|---|---|
| (a) Training labels | A permit with oo_at_permit = False is demoted to y = 0. The row is not deleted from the labels parquet — permit_scope and gold_vintage stay — but every consumer ANDs is_roofing with oo_at_permit when computing positives. |
| (b) Training rows | The per-property training universe is restricted to oo_at_t0 = True, computed identically against the silver snapshot ≤30d pre-T0. Absentees at T0 never enter training — this drops roughly 11% of training rows. |
| (c) Call-list | The shipped list requires OwnerOccupied = 'Y' and norm(Mailing) == norm(Situs) at the current silver snapshot. Zero absentees ship, regardless of model score. |
1.4 · Canonical event-date rule Frozen
Every permit gets exactly one trustworthy date so Step 4 can decide whether it falls inside a row's forward label window. The rule is the earliest signal that the permit existed — the minimum of its six non-null status dates, capped against future-date corruption.
event_date = MIN(non-null status dates), then capped
permit_scope for audit.1.5 · Output contract — labels.parquet Locked
Step 1.2 writes one labels parquet per FIPS at data/sandbox/classify_v5/<FIPS>/labels.parquet — one row per Gold permit (PK = building_permit_id), row count equal to the raw Gold row count. The schema is locked and matches the v5.3.3 classifier output as produced today.
| Column | Type | Notes |
|---|---|---|
| building_permit_id | varchar | Gold PK back-pointer; stable join key for every downstream consumer |
| fips | varchar(5) | 5-digit zero-padded (e.g. 12103) |
| permit_scope | list<struct<type, action>> | The MECE taxonomy; type ∈ 20-enum, action ∈ 7-enum. Never empty. |
| spec_v | varchar | Classifier spec version, e.g. 'v5.3.3' |
| gold_vintage | varchar | Source Gold parquet vintage, e.g. 'Roofing_Master_Gold_V1_20260512' |
| oo_at_permit | bool Pending v5.4 | Owner-occupied-at-permit flag (§1.3). Written by a post-processor; bumps spec_v v5.3.3 → v5.4.0. |
No derived columns (is_roofing, roof_action, a primary permit_type) are stored — they are non-MECE projections that every consumer derives from permit_scope itself. The consumer contract is binding: pin both spec_v and gold_vintage, derive is_roofing as 'ROOFING' ∈ permit_scope[].type, and never re-classify at the consumer.
Where this connects
| Direction | What |
|---|---|
| Detail | The two-axis permit_scope classifier → permit classification |
| Reference | All 54 Gold columns + roles → Gold variable inventory |
| Next step | Coverage — which jurisdictions we can trust → Step 2 · Coverage |
| Spine | The full 9-step pipeline → roofing pipeline |
Rendered from notes/Roofing/steps/{01_variable_inventory, 04_repair_vs_replace, 05_labels_output_contract, 06_event_date_rule, 06b_oo_at_permit_label_filter}.md · classifier v5.3.3 (oo_at_permit pending v5.4.0). Null rates and action shares measured on Pinellas 12103; ROOFING:NA → positive-eligible flagged as an assumption pending hand-grade.