Roofing Pipeline · Rules Reference Plain-language rule book for Step 1 (LABELING) & Step 2 (COVERAGE) · 2026-05-21
STEP2_RUNBOOK.md. Detail-level
specs live in steps/*.md; this is the at-a-glance map.
What goes in: a permit's four text fields — TYPE,
SUBTYPE / PROJECT_NAME, PROJECT_TYPE
(structured AHJ fields), and DESCRIPTION (free text).
What comes out: permit_scope — a list of
{type, action} items, one per (object, verb) the permit touches.
One permit can do several things, so the label is multi-label by design
(≈ 58 % of permits carry 2+ items).
Every item is one type × one action
| Axis | Values |
|---|---|
| permit_type (the object — 20) | ROOFING · SOLAR · HVAC · ELECTRICAL · PLUMBING · POOL · FIRE · GENERATOR · WINDOWS · DOORS · GARAGE · SHED · DECK · FENCE · FOUNDATION · SIGN · SITEWORK · BUILDING · OTHER · UNKNOWN |
| permit_action (the verb — 7) | NEW · REPLACEMENT · REPAIR · ADDITION · ALTERATION · DEMOLITION · NA |
is_roofing and roof_action are not
stored — they are derived downstream by asking "does permit_scope
contain a ROOFING item?".
Keyword cascade — structured fields are trusted more than free text
│
▼ Tier 1 — match keywords in TYPE / SUBTYPE / PROJECT_NAME / PROJECT_TYPE
│ (structured AHJ fields — high trust; broad keywords like "roof", "mechanical" allowed here)
▼ Tier 2 — match keywords in DESCRIPTION (free text)
│ (only specific / unambiguous keywords trusted here — broad ones false-positive)
▼
permit_scope = [ {type, action}, … ]
Each permit_type has its own keyword set. Specific
keywords (air conditioning, swimming pool) are
trusted in both tiers. Broad keywords (mechanical,
electrical) are trusted only in Tier 1 — in a free-text
DESCRIPTION they catch too much.
Three guards keep ROOFING honest
re-roof, reroof,
shingle, tear off, metal roofing,
tile roof, built-up roof, modified
bitumen, TPO roof, standing seam… — the
permit IS a roof job, even if boilerplate also
mentions something else.ROOFING item: waterproofing (contains "roofing"),
roof deck, roof drain, roof truss,
patio roof, roof over patio, tent.roof and the text says the roof is just a
place — on roof, rooftop, roof
mounted — the ROOFING item is SUPPRESSED
("AC on roof" is HVAC, not roofing).| Permit DESCRIPTION | permit_scope | why |
|---|---|---|
| "Tear off & re-roof, class-A shingles" | [{ROOFING, REPLACEMENT}] | strong-roof compound "re-roof" |
| "Install roof-mounted PV system 6.2 kW" | [{SOLAR, NEW}] | "roof" is only a location → guard 3 suppresses ROOFING |
| "Tear off shingles, re-roof, install solar PV" | [{ROOFING, REPLACEMENT}, {SOLAR, NEW}] | one permit, two real trades |
| "Repair roof leak" | [{ROOFING, REPAIR}] | roof + repair verb |
| "New single-family residence" | [{BUILDING, NEW}] | a new house — the roof is not a re-roof event |
PROJECT_TYPE field — ≈ 198 K permits on the
30-FIPS dev layer that PROJECT_TYPE calls roofing and the v5
keyword cascade missed. The old "≈ 99 %" figure was circular (it compared v5
to the previous classifier, not to an independent reference). Detail:
independent recall check.
Step 2 produces, for every (FIPS, FA-municipality, jurisdiction)
tuple, a coverage_decision ∈ {INCLUDE, FLAG, EXCLUDE}. Only
INCLUDE geographies feed model training and the
lead list. The whole step is built on one idea: two messy
"municipality" vocabularies have to be cleaned into one shared key before
they can be compared.
Municipality is
a legal-jurisdiction field full of district codes; BuildZoom names a
STATE_County_City jurisdiction. Step 2 cleans both
sides into the same canonical key — (state, county_fips,
canonical_place, place_type) — so they can be matched. That is steps
2A → 2B → 2C below. 2D is the decision tree that runs on the matched result.
First American Municipality → canonical place
FA's Municipality is the legal jurisdiction, not the
city. Each distinct value is sorted into one of 4 status
buckets. Only city_named carries a city
— the other three set city = NULL on purpose.
city_named
"COCONUT CREEK" · "EAST EARL TWP" → EAST EARL
unincorporated
district_code
unknown
Distribution (non-NULL strings):
city_named 89.3 % · unincorporated 7.7 % · district_code 2.0 % · unknown
1.0 %. A NULL value is not assumed unincorporated — FA has a separate
explicit "UNINCORPORATED" value, so NULL stays unknown.
Classifier: scripts/roofing/classify_fa_municipality.py · 72.76 M
parcels · audited 4 rounds × 3 agents (1,800 graded cases, ~99 % non-LA
accuracy post-2026-05-21 patches).
Audit-confirmed edge cases — what the classifier catches
| FA Municipality string | → bucket | canonical_place | rule |
|---|---|---|---|
USD 260 EPCD RODY | district_code | — | USD + digit = school district |
SD#9 CITY OF SEDONA/FD SEDONA | district_code | — | SD# stub overrides embedded city name |
WHITNEY ARTESIAN BASIN | district_code | — | bare-word BASIN (R1 patch) |
DENHAM ACRES LIGHTING | district_code | — | bare-word LIGHTING (R1 patch) |
DOWNTOWN IMPROVEMENT DISTRICT | district_code | — | bare IMPROVEMENT DIST(RICT) (R1 patch) |
HUNTER MILL RESTON SERVICE | district_code | — | trailing SERVICE (R1 patch) |
GENERAL SERVICES | district_code | — | plural SERVICES (R2 patch) |
BRADDOCK TRANSPORTATION | district_code | — | TRANSPORTATION token (R2 patch) |
HUNTER MILL TYSONS SERVICE DIS | district_code | — | SERVICE DIS(T)? truncated (R2 patch) |
MT VERNON DIST. #1 | district_code | — | DIST. period tolerance (R3 patch) |
96-RT | district_code | — | digit-prefix stub (R1 patch) |
UN-INCORPORATED | unincorporated | — | hyphen-tolerant UN-?INCORP (R3 patch) |
SALT LAKE COUNTY | unincorporated | — | state-scoped real-county lookup |
CITY OF PISMO BEACH | city_named | PISMO BEACH | prefix strip |
EAST EARL TWP | city_named | EAST EARL | suffix strip |
FORT LEE BORO | city_named | FORT LEE | NJ BORO suffix (R2 patch) |
AURORA (TOV) | city_named | AURORA | parenthetical (TOV) strip (R1 patch) |
BROOKLYN HTS. | city_named | BROOKLYN HEIGHTS | HTS→HEIGHTS + period strip (R3 patch) |
SOUTH /COMPTON-N/W, HALL OF ADMIN/... stay
in city_named by design. No decision impact — all fall to
city_under_county → CA_Los Angeles →
same coverage decision. Deferred upstream cosmetic.
(2) Bare-township names without district tokens
(NEW TRIER, PROVISO) need a township gazetteer to
catch — out of scope for the regex classifier.
BuildZoom jurisdiction → canonical key
BuildZoom names a jurisdiction as STATE_County_City or
STATE_County. The normalizer parses each of the ~2,500 distinct
strings into the same canonical shape the FA side emits, so
the two can be joined.
▼ parse · county → FIPS (Census 2024 gazetteer, 3,222 counties) · fold diacritics · resolve aliases
( state=FL , county_fips=12086 , canonical_place=MIAMI GARDENS , place_type=city )
County → FIPS resolves 99.7 %. Status: 80.0 %
resolved to a place · 19.7 % county-level · 0.3 % malformed. Normalizer:
scripts/roofing/normalize_bz_jurisdiction.py · audited 4 cycles ·
100 %.
FA ↔ BZ on the shared canonical key
Both sides now speak (county_fips, canonical_place). Joining
them gives one of four outcomes per FA municipality:
| Outcome | Meaning | Example | SFH |
|---|---|---|---|
| city_matched | FA city ↔ a BZ city jurisdiction. Strongest link. | PINE → PA_Allegheny_Pine Township | 31.1 % |
| city_under_county | FA names a city BZ has no separate jurisdiction for, but BZ covers the county. Permit data exists at county grain. | SOUTH /COMPTON-N/W → CA_Los Angeles | 22.6 % |
| county_matched | FA unincorporated / district / unknown ↔ a BZ county jurisdiction. | UNINCORPORATED AREA → FL_Volusia | 12.9 % |
| no_bz | No BZ jurisdiction for that county at all → no permit coverage. | MUNHALL → — | 33.3 % |
66.7 % of FA single-family homes fall under
a BuildZoom jurisdiction (the first three rows). The last row feeds
the EXCLUDE gate below. Script:
scripts/roofing/match_fa_bz.py → match_table_v3.
Coverage decision tree — 4 gates, first fail decides
None → EXCLUDE — vendor confirms no data(empty) → EXCLUDE — year not ingestedSome → FLAG — partialYes ↓ continuematch_rate = SFH-with-a-roof-permit / total-SFH.
Gate 3 NEVER hard-EXCLUDEs —
see the note below.≥ 50 % → INCLUDE25–50 % → FLAG — include with caveat< 25 % → FLAG — manual reviewmatch_rate is a measurement. A low
value cannot tell apart (a) "BuildZoom genuinely has no permits here"
from (b) "BuildZoom has them, but our address join failed to
connect permit ↔ parcel". Example: Phoenix — 368 K homes, provider says
Yes, 25 years of history — scores 4.3 %, almost certainly a
broken join. So a low match rate is a FLAG for review, never
a confirmed absence. Only the vendor-authoritative gates 1–2 hard-EXCLUDE.
coverage_recovery_queue.csv — geographies whose own
measured evidence contradicts a conservative decision (e.g. provider says
None but we measure ≥ 25 %). Current run: 274
jurisdictions / 9.17 M SFH queued for re-review.
Worked examples
| FIPS · FA muni | jurisdiction | provider | match_rate | → decision |
|---|---|---|---|---|
| 12086 · MIAMI GARDENS | FL_Miami-Dade_Miami Gardens | Yes | 67.9 % | INCLUDE gate 3 ≥ 50 % |
| 06037 · SOUTH /COMPTON-N/W | CA_Los Angeles | Yes | 36.3 % | FLAG gate 3 25–50 % |
| 13255 · SUNNYSIDE | GA_Spalding | None | — | EXCLUDE gate 2 |
| 42003 · MUNHALL | NOBZ_42003 (none) | — | — | EXCLUDE gate 1 |
Current national run: 32,179 tuples — 1,006 INCLUDE · 9,424 FLAG · 21,749 EXCLUDE (9.6 % / 46.2 % / 44.2 % of SFH). 1,420 of 1,421 FIPS — Dallas 48113 is excluded (a broken FA silver file, needs an AWS re-pull).
audits/2026-05-21_step2_3layer_audit_summary.md.
Action Plan System
Outbound list cadence, scoring flow, and exclusion logic · appended from ACTION_PLAN_SYSTEM.html
Action Plan System Roofing outbound list cadence + exclusions · v1 (2026-05-25)
notes/findings/79_action_plan_roofing_cadence.md. This page is the visual + monthly-process reference.End-to-end flow
Why this system exists
Before v1, the CallZeke 15K list shipped with 36 non-SFH rows, 90 blank mailing addresses, 82 zero-EMV rows, and an Action plan = "30 days" hardcoded on every single row. The bigger problem: monthly recurring lists with no memory of prior touches burn the audience inside one quarter.
Layers at a glance
| Layer | What it does | Status |
|---|---|---|
| R1-R4 exclusions | SFH-only · mailable · EMV>0 · USPS DPV ∈ {Y,S,D}. Drop before tier assignment. | shipped |
| Tier mapping | Per-county percentile → High top 10% / Medium p60-p90 / Low < p60. | shipped |
| Score band display | Client-facing score is rank-scaled inside each tier into a non-overlapping band: High 85-100, Medium 55-84, Low 30-54. Top of list = 100; bottom of kept set = 30 (raw probability is preserved separately for internal use). Reframes "score = rank inside top 3%" rather than absolute probability — avoids "0.4 / 1.0" perception that the bottom of the list is garbage. | shipped 2026-05-26 (v17j) |
| Cadence schedule | High 3 DM + 2 CC over 12wk · Medium 3 DM · Low 2 DM. SMS deferred. | shipped |
| Owner CC cap | Max 1 cold call per owner across the list (DM stays per-mailbox). | shipped |
| Touch ledger | One row per (property × touch event) · Wayfind feed · cross-list memory. | designed · build next |
| Eligibility filter | Cooldown + cap rules using ledger · freshness decay on effective score. | future |
| Uplift model | P[respond to contact] trained on response data · multiplies propensity. | future |
Touch volume + cost per monthly ship
| Tier | Properties | DM (12wk) | CC (12wk) | Est. cost (DM $0.50 · CC $5) |
|---|---|---|---|---|
| High (30d) | 1,503 | 4,509 | 2,978 | $17,145 |
| Medium (60d) | 4,498 | 13,494 | 0 | $6,747 |
| Low (90d) | 8,999 | 17,998 | 0 | $8,999 |
| Total | 15,000 | 36,001 | 2,978 | ~$33K |
Monthly maintenance + send process
The list ships monthly. The same household stays eligible for re-contact only if its cooldown elapsed (Phase 2 enforces this; Phase 0 ships without ledger memory). A clean monthly cadence:
| Day | Action | Owner |
|---|---|---|
| D-7 | Pull fresh silver REM partition (data/sandbox/silver_rem/<month>/) + gold permit vintage. Refresh enrichment + scored parquet. | Ignacio (DS) |
| D-5 | Run scripts/roofing/build_15k_with_exclusions.py. Smarty cache amortizes — only new addresses incur lookups. | Ignacio |
| D-4 | Audit pass: compare per-tier score drift vs prior month · sample 50 rows for sanity · run scripts/roofing/audit_15k_client_list.py. | Ignacio |
| D-3 | Phase 1+: read touch ledger · exclude households still on cooldown · refill from scored pool · re-run quota slice. | Ignacio (Phase 1 build) |
| D-2 | Camilo review — spot-check 20 rows · sign off on tier mix. | Camilo |
| D-1 | Strip internal columns (owner_hash, list_id, smarty_*) → client-facing CSV (or keep all if Wayfind wants the metadata). | Ignacio |
| D-0 | Send: upload to Wayfind (S3 bucket or shared Google Drive — confirm channel w/ ops). Email Zeke + Camilo with summary stats + audit notes. | Ignacio + Camilo |
| D+1 → D+90 | Wayfind / Salesmate logs each touch back to touch_ledger.parquet (Phase 1). Response codes captured as DM bounces / CC outcomes / SMS replies come back. | Wayfind ops · Camilo |
| D+30 / D+60 / D+90 | Mid-cycle metrics review: response rate by tier · DM bounce rate · CC connect rate. Adjusts λ + tier cutoffs for next ship. | Ignacio + Eduardo |
Send mechanism (v1 · pending Wayfind confirmation)
- Upload the CSV to the Wayfind shared bucket (path TBD — Camilo to confirm). Tag with
list_id = callzeke_<YYYY-MM-DD>. - Email Zeke + Camilo with the file path, total row count, tier breakdown, expected touch volume, and link to
CALLZEKE_ROOFING.html+ this page. - Slack notification to the #roofing channel with one-line summary (e.g., "callzeke 2026-06-30 list shipped · 15K rows · 1,503 High / 4,498 Med / 8,999 Low · expected lift 4.7×").
- Archive: keep prior months' CSVs under
data/sandbox/model/callzeke_PROD_15k_<YYYY-MM-DD>_v<n>_full15k.csv. Symlink or aliaslatest.
Phase roadmap
| Phase | What | Status |
|---|---|---|
| 0 | R1-R4 exclusions + tier + cadence + owner cap (the page above). | shipped 2026-05-25 |
| 1 | touch_ledger.parquet + Wayfind ingestion. Build before second monthly ship. | designed |
| 2 | Eligibility filter at silver-rebuild time + freshness decay on effective score. | future |
| 3 | Uplift / receptivity model trained on response data · horizon retrain to match ship cadence. | future (≥ 6mo response data needed) |
References
- Long spec —
notes/findings/79_action_plan_roofing_cadence.md(10 sections incl. why-not-alternatives, defaults table, open questions) - Code —
scripts/roofing/build_15k_with_exclusions.py·lib/pipeline_filters.py·lib/smarty_client.py - Audits —
notes/Roofing/audits/2026-05-25_callzeke_v4_audit_agent_*.md(5-agent v4 review) ·2026-05-25_callzeke_audit_agent{1,2,3}_*.md(3-agent pre-build audit) - Canonical —
Analytics/docs/BUSINESS_CONTEXT.md §Action Plans (Roofing extension) - Cuaderno —
PROGRESS_NOTEBOOK.html(closing panel 2026-05-25)