OverviewData › Step 1 · Labeling

Step 1 · Labeling

Define what counts as a roof-replacement event, and what every Gold column is for — before any modeling begins.

Labeling is the first thing the roofing pipeline locks. Step 1 answers two questions that everything downstream depends on: what counts as a valid roof-replacement positive, and what each raw Gold permit column is used for. Coverage (Step 2) and model training (Steps 3–5) both read this label definition as a fixed contract — so if the label is wrong, every metric above it is measuring the wrong thing.

This page is the parent overview of the data-layer labeling logic. The permit classifier itself — the two-axis permit_scope taxonomy that turns raw permit text into structured (type, action) pairs — has its own page.

Where Step 1 sits. Step 1 (Labeling) → Step 2 (Coverage) → enrichment, folds, features, training. The full nine-step spine is on the roofing pipeline page.

1.1 · Variable inventory Done

The legacy pipeline hand-picked five text fields for the classifier and ignored the other ~48 columns with no written rationale. Step 1.1 closes that gap: every one of the 54 Gold columns is enumerated and assigned exactly one role. The inventory was generated by scripts/roofing/inventory_v3.py across the 31 FIPS on disk — all 54 columns are present in 31/31 FIPS, with no AHJ-specific columns.

7
LABEL_SIGNAL — text/array fields that discriminate permit categories
23
FEATURE — downstream predictors (dates, value, geo, owner)
22
METADATA — provenance / system-of-record fields
2
DROP — near-total-null or non-recoverable noise

The LABEL_SIGNAL columns

Only the seven LABEL_SIGNAL columns feed the permit classifier (Step 1.2). Everything else is held out for features or dropped. Null rates below are measured on Pinellas (12103).

ColumnTypeNull (Pinellas)What it contributes
PROJECT_TYPEarrayPrimary T1 signal — structured project-type array
TYPEtext0%Primary tier signal — never null in Pinellas
SUBTYPEtext42.9%Primary tier signal; drives repair-vs-replacement detection
DESCRIPTIONtext40.5%Primary free-text signal when populated
PROJECT_NAMEtext65.1%High-specificity signal when populated
BUSINESS_NAMEtextAuxiliary roof signal (e.g. "ABC ROOFING & TILE"); promoted to LABEL_SIGNAL in v3
JOB_VALUEnumericDual role: discriminates REPLACEMENT vs REPAIR and a downstream feature

High-null DROP columns

Two columns are dropped on near-total-null evidence: UNITS (100% null in Pinellas) and STORIES (97.1% null). SUBDIVISION (75.9% null) was flagged as a drop candidate. The full 54-column table — role, dtype, median/max null rate, and distinct counts per FIPS — lives on the Gold variable inventory page.

1.2 · Permit taxonomy Done

The classifier reads only the LABEL_SIGNAL columns and emits a single MECE column, permit_scope — a list of (type, action) structs describing every object the permit touches and the verb applied to it. It is a two-axis alphabet: 20 types (ROOFING, BUILDING, HVAC, POOL, SHED, …) crossed with 7 actions (NEW · REPLACEMENT · REPAIR · ADDITION · ALTERATION · DEMOLITION · NA). The scope is never empty — a scopeless permit is recorded as [{OTHER, NA}] or [{UNKNOWN, NA}].

Full spec lives elsewhere. The complete 20×7 taxonomy, cascade order, keyword tiers, and version history are on the permit classification page. This overview deliberately does not duplicate it.

Repair vs replacement — which roof permits count

Within the roofing family, the positive-label question is precise: a re-roof is a replacement event, not a patch. The binding Step 4 rule classifies each roofing scope item by its action.

Positive-eligible — counts toward the target

1
ROOFING : REPLACEMENT
Confirmed re-roof. Explicit verb in the permit text (reroof / re-roof / replace / tear-off).
2
ROOFING : NA — the "ambiguous re-roof" bucket
A roof permit with no parseable verb. Treated as positive-eligible: a roof permit defaults to a replacement absent a contrary verb. Step 4 must not silently drop these — they are ~53% of roofing items, so dropping them would gut more than half the roof positives.

Not a positive

1
ROOFING : REPAIR
A patch (repair / leak), not a replacement. Carried only as a feature — prior-repair history — never as a positive event.
2
ROOFING : NEW (with a {BUILDING, NEW} scope), ADDITION, ALTERATION, DEMOLITION
New-construction roofs and non-replacement actions are excluded from positives.

Measured on Pinellas (12103): of 9,522 ROOFING items, 52.6% carry action = NA, REPLACEMENT 34.5%, NEW 5.9%, ALTERATION 4.6%, REPAIR 1.4%. There is no AMBIGUOUS action value — NA is the bucket the earlier ADR meant by "ambiguous": a roof permit whose action is unparseable, not a confirmed non-replacement.

Open — owed a human hand-grade. Treating ROOFING:NA → positive-eligible is the documented default, flagged as an assumption: confirm by hand-grading a random NA sample that the bucket is genuinely re-roof-dominated. The roof_action derivation also scans the whole DESCRIPTION blob (not only the roof-context clause), so per-action precision beyond "is it a re-roof" is approximate; the positive/negative split is unaffected because both NA and REPLACEMENT land positive-eligible.

1.3 · Owner-occupied-at-permit rule Permanent

A roof permit only counts as a valid positive if the owner was occupying the property at the time of the permit. This is a permanent client directive (CallZeke, 2026-05-26), in force for every training run and every shipped list from v18 onward.

oo_at_permit =
   (silver.OwnerOccupied == 'Y'  at the snapshot ≤30d pre-permit)
   OR
   (norm(MailingFullStreetAddress) == norm(SitusFullStreetAddress)
                                 at the snapshot ≤30d pre-permit)

The occupancy test is measured against the silver snapshot dated ≤30 days before the permit issue date, using the same address normalizer the FA↔BuildZoom matcher uses. The rule is applied in three places:

WhereEffect
(a) Training labelsA permit with oo_at_permit = False is demoted to y = 0. The row is not deleted from the labels parquet — permit_scope and gold_vintage stay — but every consumer ANDs is_roofing with oo_at_permit when computing positives.
(b) Training rowsThe per-property training universe is restricted to oo_at_t0 = True, computed identically against the silver snapshot ≤30d pre-T0. Absentees at T0 never enter training — this drops roughly 11% of training rows.
(c) Call-listThe shipped list requires OwnerOccupied = 'Y' and norm(Mailing) == norm(Situs) at the current silver snapshot. Zero absentees ship, regardless of model score.
This rule is frozen — do not relax it. The client's instruction was explicit: trust the business definition over the model. "Owner-occupied people are the ones who actually change their roofs — that's who I want, period." Mixing absentee re-roofs (rentals, flips, insurance-driven turnover, SFR portfolios) into the labels teaches the model to predict the wrong decision and flatters the lift number with leads we would never mail to. When no silver snapshot satisfies the ≤30d window (e.g. permits before 2022-05), oo_at_permit is NULL and the row is conservatively treated as y = 0.

1.4 · Canonical event-date rule Frozen

Every permit gets exactly one trustworthy date so Step 4 can decide whether it falls inside a row's forward label window. The rule is the earliest signal that the permit existed — the minimum of its six non-null status dates, capped against future-date corruption.

event_date = MIN(non-null status dates), then capped

1
Take MIN of the six dates APPLIED · ISSUED · INITIAL_STATUS · COMPLETED · CANCELLED · LATEST_STATUS
The earliest date is closest to "when the owner decided to act." MIN keeps every row where any date exists (~98%) and biases conservatively toward the earliest signal of intent. MAX was rejected — it captures the most recent activity, downstream of the decision.
2
Cap at today + 90 days
Gold contains future-dated corruption (e.g. 2066-01-01 — year-rollover bugs in the source AHJ system). The cap keeps an honest-but-bounded date and adds provenance: "event_date:capped_future" so it can be audited.
3
Read it forward from T0 — never as a feature anchor
Anchoring is T0-relative, not event-relative. y = 1 iff a qualifying roof-replacement permit has event_date ∈ (T0, T0+H], H = 6 months. Features are as-of ≤ T0. The event never defines a feature anchor.
Not a stored label column. event_date is computed by the Step 3 / Step 4 permit reader on demand — it is deliberately absent from the Step 1.4 labels contract below. If all six dates are null the permit has no usable date and is excluded from positive/negative construction (provenance: "event_date:all_null"), but it keeps its permit_scope for audit.

1.5 · Output contract — labels.parquet Locked

Step 1.2 writes one labels parquet per FIPS at data/sandbox/classify_v5/<FIPS>/labels.parquet — one row per Gold permit (PK = building_permit_id), row count equal to the raw Gold row count. The schema is locked and matches the v5.3.3 classifier output as produced today.

ColumnTypeNotes
building_permit_idvarcharGold PK back-pointer; stable join key for every downstream consumer
fipsvarchar(5)5-digit zero-padded (e.g. 12103)
permit_scopelist<struct<type, action>>The MECE taxonomy; type ∈ 20-enum, action ∈ 7-enum. Never empty.
spec_vvarcharClassifier spec version, e.g. 'v5.3.3'
gold_vintagevarcharSource Gold parquet vintage, e.g. 'Roofing_Master_Gold_V1_20260512'
oo_at_permitbool Pending v5.4Owner-occupied-at-permit flag (§1.3). Written by a post-processor; bumps spec_v v5.3.3 → v5.4.0.

No derived columns (is_roofing, roof_action, a primary permit_type) are stored — they are non-MECE projections that every consumer derives from permit_scope itself. The consumer contract is binding: pin both spec_v and gold_vintage, derive is_roofing as 'ROOFING' ∈ permit_scope[].type, and never re-classify at the consumer.

Where this connects

DirectionWhat
DetailThe two-axis permit_scope classifier → permit classification
ReferenceAll 54 Gold columns + roles → Gold variable inventory
Next stepCoverage — which jurisdictions we can trust → Step 2 · Coverage
SpineThe full 9-step pipeline → roofing pipeline

Rendered from notes/Roofing/steps/{01_variable_inventory, 04_repair_vs_replace, 05_labels_output_contract, 06_event_date_rule, 06b_oo_at_permit_label_filter}.md · classifier v5.3.3 (oo_at_permit pending v5.4.0). Null rates and action shares measured on Pinellas 12103; ROOFING:NA → positive-eligible flagged as an assumption pending hand-grade.