Feature taxonomy
How raw silver columns become model features — the variable buckets, the INCLUDE/REVIEW/EXCLUDE triage, and the synthetic-feature families that fold many raw columns into one business-readable signal.
Every model input traces back to the silver property schema. Before a column can become a feature it gets sorted into exactly one bucket and tagged with an action: INCLUDE (used), REVIEW (candidate, not yet wired), or EXCLUDE (dropped — identifier, free-text, redundant, or label-leakage). The synthetic-feature layer then collapses several raw columns into a single derived signal so the model sees signal, not redundancy.
Status — building. The REI roofing model is in active development. INCLUDE/REVIEW/EXCLUDE counts reflect the propose_action triage in notes/columns_reviewed.csv; REVIEW columns are candidates that may or may not be wired into a given model version. Three date-diff synthetics are under an active leakage audit (flagged below).
The variable taxonomy
The 476 silver columns are partitioned into 15 buckets. Each bucket holds one concept (identifiers, physical, owner, financial, distress, location…), and each column lands in exactly one bucket. Counts below are taken directly from the taxonomy summary table.
| Bucket | What it holds / provenance | N | INCLUDE | REVIEW | EXCLUDE |
|---|---|---|---|---|---|
| Hudi / pipeline metadata | Partition salting, dedupe keys, row timestamps, refresh date — pipeline plumbing, not signal | 4 | 0 | 1 | 3 |
| Core identifiers | APN, PropertyId, Period, FIPS — keys; only FIPS and OldApnIndicator kept | 8 | 2 | 0 | 6 |
| Current transaction label leakage | Buyer/seller names, sale price, recording dates of the current sale — describes the outcome, so all excluded | 27 | 0 | 0 | 27 |
| Distress signals | 23 distress types × active flag + date (preforeclosure, probate, tax-delinquent, divorce, eviction, liens…) | 42 | 39 | 0 | 3 |
| Property — physical | Year built, use, area, rooms, construction/condition/quality codes | 29 | 7 | 11 | 11 |
| Property — location & geography | Situs address parts, ZIP, census tract/block, municipality, subdivision | 24 | 4 | 6 | 14 |
| Owner identity free-text | Owner first/middle/last names, mailing address strings — PII free-text, all excluded | 11 | 0 | 0 | 11 |
| Owner — relationship with property | Absentee, owner-occupied, owner type, equity flag, ownership tenure | 12 | 3 | 2 | 7 |
| Financial — valuations | Assessor / market / AVM values, tax amount, FA confidence score | 13 | 1 | 6 | 6 |
| Financial — mortgage stack | Mtg1–Mtg4 + concurrent liens: balances, rates, terms, lender names, LTV/CLTV/CLBTV | 86 | 2 | 16 | 68 |
| Financial — listing / MLS activity | List price, listed flag, listing status | 5 | 0 | 0 | 5 |
Prior transaction (PrevSale*) | Previous sale price, dates, prior-mortgage detail — prior, so usable as-of-T0 | 39 | 2 | 15 | 22 |
| Provider-scored signals (FA) | First American propensity / liquid-assets / purchase-intel scores | 3 | 1 | 2 | 0 |
| Quality / validity flags | FA *Valid recording-detail flags, cash-buyer flag, last-PFC flag | 6 | 1 | 2 | 3 |
| Miscellaneous — unclassified | Listing-agent fields, attic/basement/garage sqft, exemption flags, HOA, PFC auction detail, legal description | 167 | 5 | 48 | 114 |
| Total | 476 | 67 | 109 | 300 |
How the action tags work
| Tag | Meaning | Typical reason |
|---|---|---|
| INCLUDE | Used as a feature (raw or as input to a synthetic) | Carries as-of-T0 signal, low enough null rate to be useful |
| REVIEW | Candidate — not yet committed to a model version | Plausible signal but uncertain quality / coverage |
| EXCLUDE | Dropped from the model | Identifier, free-text PII, within-concept redundancy, or current-transaction label leakage |
Worked examples of the triage
| Column | Bucket | Action | Null | Why |
|---|---|---|---|---|
| FIPS | Core identifiers | INCLUDE | 0.00 | Five-digit county code — geography anchor |
| CurrentSalesPrice | Current transaction | EXCLUDE | 0.20 | Describes the outcome being predicted — label leakage |
| PreforeclosureDistress | Distress signals | INCLUDE | 1.00 | Lender filed NOD — sparse but high-signal |
| Owner1FirstName | Owner identity | EXCLUDE | 0.32 | Free-text PII, no model value |
| PropensityScore | Provider-scored (FA) | INCLUDE | 0.55 | FA's own propensity-to-sell — benchmark / feature |
| PrevSalesPrice | Prior transaction | INCLUDE | 0.46 | Prior sale — fully as-of-T0, feeds appreciation |
Null and distinct rates above come from the per-column tables in the source. See Gold variable inventory for the full column-level reference and Data sources for provenance of each feed.
Synthetic features
A synthetic feature replaces one or more raw columns with a single business-readable signal — redundancy removed, signal preserved, each mapping documented as raw columns (inputs) → synthetic feature (output). Sixteen families are documented. Across them, roughly ~55 raw columns are folded into ~25 synthetic signals. The catalog is authoritative; the code in src/new_model/features.py reflects it.
Owner, property & valuation synthetics
| Feature | Raw inputs | Formula | Meaning |
|---|---|---|---|
| absentee_level | Absentee, MailingState, SitusState | 0 if not absentee; 1 if absentee & in-state; 2 if absentee & out-of-state | 3-way absenteeism. Out-of-state (level 2) = highest sale propensity. Miami-Dade 2024-03 sample: 6.7% are level 2. |
| property_age_years | YearBuilt, T0 | T0_year − YearBuilt | Age, not static year. Miami-Dade 2024-03 sample: median 48 yrs, range 2–124. |
| leverage_ratio | CLBTV, CLTV, LTV | coalesce(CLBTV, CLTV, LTV) + leverage_source tag | One consolidated loan-to-value (0–100). Collapses 3 collinear columns; source tag preserves provenance. |
| lot_size_sqft | LotSizeSqFt (canonical) | passthrough; drops acres/depth/frontage/listing/FA-est variants | Parcel land size, one unit. |
| building_area_sqft | BuildingArea, SumBuildingSqFt | coalesce(BuildingArea, SumBuildingSqFt); drops 5 other sum-area variants | Total building footprint. |
| living_area_sqft | SumLivingAreaSqFt (canonical) | passthrough; drops listing-derived variants | Interior livable area (excl. garage/basement/unfinished). |
| valuation_gap | MarketTotalValue, AssdTotalValue | MarketTotalValue / AssdTotalValue | Market premium over assessor book. >1 ⇒ hidden equity. |
Lot, building and living area are kept as three distinct concepts (they measure genuinely different things) — but within each concept one canonical column is chosen and the rest dropped. The leverage_ratio priority is CLBTV → CLTV → LTV: CLBTV (combined loan balance to value) is the most current; CLTV is at origination; LTV is the single-loan current ratio.
Transaction-history & date-diff synthetics
| Feature | Raw inputs | Formula | Notes |
|---|---|---|---|
| price_appreciation | CurrentAVMValue, PrevSalesPrice | CurrentAVMValue / PrevSalesPrice | Unrealised gain. NULL handled by has_prev_sale sidecar. |
| has_prev_sale | PrevSalesPrice | 1 if not null else 0 | Missing prior sale is signal (new build / inherited / never transacted), not noise. |
| months_since_prev_sale under audit | T0, PrevSaleRecordingDate | (T0 − PrevSaleRecordingDate).months | Owner tenure in months. Finer than DaysOwnership. Leakage-sensitive (date-diff). |
| mortgage_age_months under audit | Mtg1RecordingDate | (T0 − Mtg1RecordingDate).months | Age of first mortgage — refi proxy. Leakage-sensitive (date-diff). |
| listing_duration_months under audit | IsListedFlagDate | (T0 − IsListedFlagDate).months | How long listed for sale. Leakage-sensitive (date-diff). |
| t0_month, t0_quarter | t0 (YYYY-MM) | int(t0[5:7]); (month−1)//3 + 1 | Seasonality — spring-selling-season effect (Apr–Jun peak). |
Leakage-sensitive — under active audit. months_since_prev_sale, mortgage_age_months and listing_duration_months are the three date-diff features under an active leakage audit. Their contribution is not yet validated; do not treat it as final until the ablation runner completes.
Distress trajectory synthetics
For each of the 23 distress types the builder materialises up to three signals — an active flag, time-active, and time-since-resolved — for up to 23 × 3 = 69 distress signals (the 4 distress types without a date column get only the active flag).
| Feature | Formula | Meaning |
|---|---|---|
| <d>_active | 1 if TRY_CAST(<d> AS DOUBLE) ≥ 1.0 else 0 | Is the distress event currently active? |
| <d>_months_active | (T0 − <d>DistressDate).months when active | How long the event has been active |
| <d>_months_since_resolved | (T0 − <d>DistressDate).months when inactive | How long since the event resolved |
| n_active_distresses | sum(<d>_active) over 23 types | Total stress count (0–23) — more signals ⇒ higher sale probability |
| preforeclosure_progress_pct | (months_active × 30) / state_foreclosure_median_days | Fraction of the state's typical foreclosure timeline elapsed — cross-state comparable |
Key insight (validated 2026-04-18 across 8 counties): <d>DistressDate is bi-directional — when active=1 it is the event START; when active=0 it is the RESOLUTION. Re-emergence rate (1→0→1) is 0.31%, confirming the feature engineering is safe.
Reference-table & external joins
| Family | Source | Examples |
|---|---|---|
| State foreclosure profile | src/new_model/ref/state_foreclosure_profile.csv (50 states + DC, 2-auditor verified vs foreclosurelaw.org) | foreclosure_type, state_foreclosure_median_days, state_redemption_days, deficiency_allowed |
| External macro (Tier E, FRED) | data/cache/macro/*.parquet, aligned by t0 | mortgage_rate_30yr (MORTGAGE30US), fed_funds_rate (FEDFUNDS), unemployment_rate_national (UNRATE), home_price_index_national (CSUSHPINSA) + 6m/12m deltas |
| External micro (Tier F, BLS + Census + FHFA) | data/cache/micro/*.parquet, aligned by (fips, t0) | unemployment_rate_county (BLS LAUS, 531 counties), median_income / homeownership_rate / median_age / median_home_value (Census ACS 5-yr), hpi + hpi_yoy_pct (FHFA state) |
Outlier filter (applied after synthetics, before modelling)
| Rule | Drop row if… |
|---|---|
| Bedrooms | > 50 |
| Bathrooms | > 50 |
| Stories | > 100 |
| property_age_years | < 0 or > 400 |
Impossible values are almost always data corruption — the filter drops the row (does not clip the value), preserving legitimate edge cases such as large multi-family complexes with many bathrooms.
Where this connects
The bucketed feature tiers (A physical · B owner+distress · C valuation+activity · D date-diffs under audit · E national macro · F local market context) are defined on the methodology page. The column-level reference lives in the Gold variable inventory; how each feed is provenance-tracked is in Data sources; and how raw permit rows are classified before they become labels is in Permit classification. Back to the Data hub or the Overview.
Rendered from notes/variable_taxonomy.md + notes/synthetic_features.md.