OverviewData › Permit classification

Step 1.2 · done   ↑ Step 1 — Labeling (parent)  ·  → Step 2 — Coverage (next)

Permit classification & taxonomy

The two-axis permit_scope classifier (v5.3.3) — how every building permit is labelled WHAT it touches and WHAT is being done to it.

← Data hubStep 1 · LabelingGold variable inventoryFeature taxonomy

Every building permit we ingest is passed through one classifier that emits a single, MECE classification column: permit_scope. The v4 classifier emitted one value from a flat 25-entry enum that quietly mixed two unrelated questions — WHAT the permit touches (roof, pool, HVAC) and WHAT action is taken (new, replace, demolish). Five of the 25 "types" were actions wearing a type costume: a permit to "demolish a detached garage" forced the cascade to pick either DEMOLITION or GARAGE and threw the other half away. v5 splits the question into two orthogonal axes — an object axis (permit_type) and an action axis (permit_action) — so the two halves are never lost. The two axes are emitted paired inside one struct, and that paired list is permit_scope.

The two axes

Each permit_scope item is a {type, action} struct: what the permit touches drawn from a 20-value object axis, paired with what is done to it drawn from a 7-value action axis. Both axes are MECE and both are evaluated by a specificity-descending cascade — the most specific category that matches wins, the catch-alls come last.

Axis 1 · permit_type — the object

20 values · MECE · cascade evaluated most-specific → most-general

ValueWhat it isNote
SOLARSolar / photovoltaic systemBeats ROOFING — a solar-on-roof permit is solar work, not roof-cover work
POOLSwimming pool / spaBeats ROOFING — a pool-cage roof is pool work
FIREFire alarm / sprinkler / suppressionBeats ELECTRICAL & PLUMBING
GENERATORStandby / backup generatorBeats ELECTRICAL
HVACHeating, ventilation, A/CSystem type — may match broadly in free text
ELECTRICALElectrical system / wiringSystem type
PLUMBINGPlumbing / water heater / drainsSystem type — "water heater in garage" is PLUMBING, not GARAGE
ROOFINGRoof-covering systemAnti-FP guarded, compound-first keywords — see below
WINDOWSWindows
DOORSDoors
GARAGEGarage as the structureLocation-prone — Tier-1 or strong compound only
SHEDShed / accessory structureLocation-prone — carport stays anti-FP, never SHED
DECKDeckLocation-prone — Tier-1 or strong compound only
FENCEFence
SIGNSignageAbove FOUNDATION (sign footings)
FOUNDATIONFoundation / underpinningBelow SIGN; bare "footing" dropped
SITEWORKGrading / site / civil work
BUILDINGWhole-structure permitBroadest object; absorbs v4 NEW_CONSTRUCTION + STRUCTURAL
OTHERCategorizable but none of the aboveCatch-all
UNKNOWNNo legible signalLast resort

Axis 2 · permit_action — the verb

7 values · cascade first-match-wins (specific → general)

ValueWhat it isNote
DEMOLITIONRemove / tear down whole structureGuarded: suppressed when remodel / addition / tenant-improvement words co-occur (interior demo ≠ teardown)
NEWFirst-time install / new construction
REPLACEMENTLike-for-like swap of an existing systemThe confirmed-reroof verb for ROOFING
REPAIRFix part of an existing system (leak, patch, partial)For ROOFING, NOT a positive
ADDITIONExtend / add to an existing structure
ALTERATIONModify / remodel / renovateBroad catch-all action; absorbs v4 INTERIOR / RENOVATION
NANo action signal legibleThe norm, not an edge case — see positive-eligibility below

Why action replaced subtype. permit_action replaces the v4 permit_subtype — now every permit gets a sub-classification, not just roofing/solar/HVAC. INTERIOR, RENOVATION, NEW_CONSTRUCTION and DEMOLITION were all v4 "types" that are really actions; they moved here.

Cascade order (shipped). The v5 spec lists actions most-specific first; the shipped classify_v5.py evaluates them NEW (whole-structure) → DEMOLITIONREPLACEMENTREPAIRADDITIONNEW (component) → ALTERATIONNA, first match wins. Precedence matters in mixed text: "new construction reroof"REPLACEMENT; "teardown interior"ALTERATION (DEMOLITION guard).

Signal-tier scoring

Greedy substring matching was the root cause of every precision bug in both QA audits, so the classifier weights a keyword by where it appears. A field's provenance sets its tier and its score. A type joins permit_scope when its accumulated signal score ≥ 3, or when it is the primary classification.

TierFieldsOriginatorScoreTrust
Tier 1TYPE, SUBTYPEBuilding Permit — AHJ categorical label6Authoritative — any keyword match classifies
Tier 2DESCRIPTION, PROJECT_NAMEBuilding Permit — owner source text4Primary text — specific compound keywords only; outranks Tier 3
Tier 3PROJECT_TYPE arrayBuildZoom-derived classification3Supporting — classifies only when Tiers 1-2 are silent; loses to a Tier-2 compound
DroppedBUSINESS_NAMEContractor nameNot used for classification (contractor names hijacked rows). Kept as a feature.

Scoring must satisfy: TYPE/SUBTYPE > DESCRIPTION compound > PROJECT_TYPE token > generic fallback. Location-prone types (GARAGE, SHED, DECK) fire only on a Tier-1 match or a strong Tier-2 compound — never on a bare "garage" mention in DESCRIPTION. System types (HVAC, PLUMBING) may match broadly in Tier 2 because their keywords are unambiguous.

The v5.0.1 re-rank — why PROJECT_TYPE was demoted Tier 1 → Tier 3

1
v5.0.0 trusted PROJECT_TYPE as authoritative
It was placed in "Tier 1" alongside TYPE/SUBTYPE, ranked above the DESCRIPTION source text.
2
But the BuildZoom dictionary shows it is a derived field
PROJECT_TYPE has Originator = BuildZoom (a derived classification), while TYPE/SUBTYPE/DESCRIPTION have Originator = Building Permit (primary source text). A derived field must not outrank the source text it was derived from.
3
It silently relabelled ~200K real re-roofs co-tagged HVAC
Re-roofs co-tagged HVAC by BuildZoom's PROJECT_TYPE lost their ROOFING classification, because PROJECT_TYPE's HVAC token outranked the DESCRIPTION reroof compound. v5.0.1 demoted PROJECT_TYPE to Tier 3 (score 3): it now classifies only when Tiers 1-2 are silent, and always loses to a DESCRIPTION compound.

ROOFING — anti-false-positive guards

ROOFING is the highest-stakes type and the most error-prone, because roof is a substring of many unrelated words and a roof is the location of much non-roofing work. ROOFING therefore uses compound-first keywords (re-roof, roof replacement, shingle, metal roofing, tile roof, tpo, built-up roof) and is blocked by anti-FP guards unless a genuine-reroof compound fires the STRONG_ROOF rescue.

Blocked by anti-FP

  • solar / PV on roof
  • patio cover
  • screen room / porch
  • pool cage / enclosure
  • sunroom / lanai
  • carport
  • pergola
  • -proofing words: waterproofing, weatherproofing, soundproofing, fireproofing, dampproofing, rustproofing
  • roof deck / roof truss / roof drain

STRONG_ROOF rescue

  • A genuine reroof compound (e.g. tear off & re-roof, roof replacement) overrides the anti-FP block.
  • So "tear off shingles, re-roof, and install solar PV" yields both ROOFING:REPLACEMENT and SOLAR:NEW — one permit, two real items.

Bare " roof " location guard v5.3.3

  • Bare space-padded " roof " was suppressed when an equipment-location phrase (ROOF_LOCATION: "on roof", "rooftop", "roof mounted", "roof area") is the only context.
  • Kills "change out 3-ton AC on roof" and "rooftop packaged unit" false positives, which previously added a spurious ROOFING:NA item.

Boilerplate stripping v5.3.2

Some jurisdictions append a long standard inspection checklist to every permit DESCRIPTION ("separate permits are required for electrical, plumbing, heating…", "footing and foundation inspection", "Gopher State One Call prior to digging"). That boilerplate names every trade by design, so the classifier scored it and stuffed the scope with phantom items that were never the permit's actual work. v5.3.2 truncates DESCRIPTION before the earliest of 8 boilerplate markers (BOILERPLATE_MARKERS) — only the real work-description prefix is scored.

Minneapolis (FIPS 27053) example. 100,223 permits carry the boilerplate signature — 99,660 of them in Minneapolis, essentially a single-jurisdiction artifact. Before the fix, boilerplate rows averaged 3.84 scope items vs 1.22 for normal permits (≈ 2.6 phantom items each). After v5.3.2, boilerplate-row average dropped to 2.16; the residual ≈ 1 phantom item is PROJECT_TYPE Tier-3 leakage (Phase 2), not text.

Output contract recap

The classifier (scripts/roofing/classify_v5.py) writes one row per Gold permit. The only taxonomy column is:

permit_scope   list<struct<type, action>>   — every (object, verb) the permit covers

type ∈ the 20 object values, action ∈ the 7 action values. The pairing is intrinsic (one struct), so type and action can never desync. permit_scope is never empty: a scopeless permit is [{OTHER, NA}] or [{UNKNOWN, NA}]. No derived columns (is_roofing, roof_action, a primary permit_type) are baked into the taxonomy — they are non-MECE projections the consumer derives:

  • is_roofing = 'ROOFING' is among the permit_scope types
  • roof_action = the action of the ROOFING item

Positive-eligibility rules (the label contract)

ROOFING itemTreatmentWhy
ROOFING:REPLACEMENTPositiveConfirmed reroof
ROOFING:NAPositive-eligibleThe "ambiguous reroof" bucket — a roof permit defaults to a replacement absent a contrary verb. ~53% of ROOFING items carry NA; dropping them would gut >50% of roof positives.
ROOFING:REPAIRNOT a positivePatch, not replacement; carried only as a prior-repair feature
ROOFING:NEW + BUILDING:NEWExcludedNew-construction roof, not a reroof event
ROOFING:{ADDITION, ALTERATION, DEMOLITION}ExcludedNot a reroof event

The full schema, join semantics and consumer contract live on the parent step: Step 1 — Labeling.

Audit headline numbers

Audited 2026-05-21 · v5.3.3 · 31-FIPS local layer · 98,946,381 permits

96.3%
is_roofing recall vs an independent BuildZoom reference (30-FIPS dev layer; an optimistic upper bound)
≈98.5%
is_roofing precision (prior audits)
5.985% → 5.743%
national ROOFING-in-scope rate, v4.0.1 → v5.3.3 (the guard shed ~144K incidental-location FPs)
0
v5.3.3-caused false negatives in the FN stratum — every dropped v4-ROOFING was a v4 false positive

MECE check passed: 98,946,381 rows (== raw Gold), 0 empty/null scope, every type in the 20-enum and action in the 7-enum; 16.6% of rows are multi-item.

Version history

DateVersionChangeImpact
2026-05-21v5.0.0Built + audited (2 reviewers, 365-row sample) → don't shipTwo-axis model live but PROJECT_TYPE over-trusted
2026-05-21v5.0.1Signal-tier re-rank: PROJECT_TYPE demoted Tier 1 → Tier 3Stopped relabelling ~200K real re-roofs co-tagged HVAC
2026-05-21v5.1.0Multi-label — emit all categories with signal + primary + is_roofingDissolves the single-label tie-break bug class (58% of permits are multi-trade)
2026-05-21v5.3.0Collapse output to one MECE column permit_scope (intrinsic type↔action pairing)Drops non-MECE derived columns; they become Step-5 projections
2026-05-21v5.3.1ROOFING keyword expansion (single ply, sbs, cap sheet, standing seam, torch down…)+11,169 ROOFING on 9-FIPS, 0 removed (monotone)
2026-05-21v5.3.2AHJ-boilerplate stripping — truncate DESCRIPTION before the inspection checklistMinneapolis boilerplate rows 3.84 → 2.16 avg scope items
2026-05-21v5.3.3" roof " location guard — suppress incidental "AC on roof" mentionsRemoved ~144K incidental-location FPs; 0 guard-caused FN
Shipped vs dev. shipped v4.0.1 dev v5.3.3   v4.0.1 remains the currently shipped classifier; v5.3.3 is the audited current development state of Step 1.2 and awaits Ignacio/Eduardo sign-off before it replaces v4.0.1. Two follow-ups are logged, neither blocking documentation: PROJECT_TYPE Tier-3 leakage (the one remaining permit_scope over-tag, slated for Phase 2) and a small EPDM/membrane keyword gap (v5.4 candidate). The ROOFING:NA → positive-eligible default is documented as an assumption pending a human hand-grade of that bucket.

Rendered from notes/Roofing/steps/02_permit_taxonomy_v5.md + 05_labels_output_contract.md · audit notes/Roofing/audits/2026-05-21_v5.3.3_classifier_audit.md · classifier scripts/roofing/classify_v5.py (v5.3.3). This page is Step 1.2 of the roofing pipeline.