OverviewData › Feature taxonomy

Feature taxonomy

How raw silver columns become model features — the variable buckets, the INCLUDE/REVIEW/EXCLUDE triage, and the synthetic-feature families that fold many raw columns into one business-readable signal.

Every model input traces back to the silver property schema. Before a column can become a feature it gets sorted into exactly one bucket and tagged with an action: INCLUDE (used), REVIEW (candidate, not yet wired), or EXCLUDE (dropped — identifier, free-text, redundant, or label-leakage). The synthetic-feature layer then collapses several raw columns into a single derived signal so the model sees signal, not redundancy.

476
silver columns triaged in the taxonomy
67
tagged INCLUDE
109
tagged REVIEW (candidates)
300
tagged EXCLUDE

Status — building. The REI roofing model is in active development. INCLUDE/REVIEW/EXCLUDE counts reflect the propose_action triage in notes/columns_reviewed.csv; REVIEW columns are candidates that may or may not be wired into a given model version. Three date-diff synthetics are under an active leakage audit (flagged below).

The variable taxonomy

The 476 silver columns are partitioned into 15 buckets. Each bucket holds one concept (identifiers, physical, owner, financial, distress, location…), and each column lands in exactly one bucket. Counts below are taken directly from the taxonomy summary table.

BucketWhat it holds / provenanceNINCLUDEREVIEWEXCLUDE
Hudi / pipeline metadataPartition salting, dedupe keys, row timestamps, refresh date — pipeline plumbing, not signal4013
Core identifiersAPN, PropertyId, Period, FIPS — keys; only FIPS and OldApnIndicator kept8206
Current transaction label leakageBuyer/seller names, sale price, recording dates of the current sale — describes the outcome, so all excluded270027
Distress signals23 distress types × active flag + date (preforeclosure, probate, tax-delinquent, divorce, eviction, liens…)423903
Property — physicalYear built, use, area, rooms, construction/condition/quality codes2971111
Property — location & geographySitus address parts, ZIP, census tract/block, municipality, subdivision244614
Owner identity free-textOwner first/middle/last names, mailing address strings — PII free-text, all excluded110011
Owner — relationship with propertyAbsentee, owner-occupied, owner type, equity flag, ownership tenure12327
Financial — valuationsAssessor / market / AVM values, tax amount, FA confidence score13166
Financial — mortgage stackMtg1–Mtg4 + concurrent liens: balances, rates, terms, lender names, LTV/CLTV/CLBTV8621668
Financial — listing / MLS activityList price, listed flag, listing status5005
Prior transaction (PrevSale*)Previous sale price, dates, prior-mortgage detail — prior, so usable as-of-T03921522
Provider-scored signals (FA)First American propensity / liquid-assets / purchase-intel scores3120
Quality / validity flagsFA *Valid recording-detail flags, cash-buyer flag, last-PFC flag6123
Miscellaneous — unclassifiedListing-agent fields, attic/basement/garage sqft, exemption flags, HOA, PFC auction detail, legal description167548114
Total47667109300

How the action tags work

TagMeaningTypical reason
INCLUDEUsed as a feature (raw or as input to a synthetic)Carries as-of-T0 signal, low enough null rate to be useful
REVIEWCandidate — not yet committed to a model versionPlausible signal but uncertain quality / coverage
EXCLUDEDropped from the modelIdentifier, free-text PII, within-concept redundancy, or current-transaction label leakage

Worked examples of the triage

ColumnBucketActionNullWhy
FIPSCore identifiersINCLUDE0.00Five-digit county code — geography anchor
CurrentSalesPriceCurrent transactionEXCLUDE0.20Describes the outcome being predicted — label leakage
PreforeclosureDistressDistress signalsINCLUDE1.00Lender filed NOD — sparse but high-signal
Owner1FirstNameOwner identityEXCLUDE0.32Free-text PII, no model value
PropensityScoreProvider-scored (FA)INCLUDE0.55FA's own propensity-to-sell — benchmark / feature
PrevSalesPricePrior transactionINCLUDE0.46Prior sale — fully as-of-T0, feeds appreciation

Null and distinct rates above come from the per-column tables in the source. See Gold variable inventory for the full column-level reference and Data sources for provenance of each feed.

Synthetic features

A synthetic feature replaces one or more raw columns with a single business-readable signal — redundancy removed, signal preserved, each mapping documented as raw columns (inputs) → synthetic feature (output). Sixteen families are documented. Across them, roughly ~55 raw columns are folded into ~25 synthetic signals. The catalog is authoritative; the code in src/new_model/features.py reflects it.

Owner, property & valuation synthetics

FeatureRaw inputsFormulaMeaning
absentee_levelAbsentee, MailingState, SitusState0 if not absentee; 1 if absentee & in-state; 2 if absentee & out-of-state3-way absenteeism. Out-of-state (level 2) = highest sale propensity. Miami-Dade 2024-03 sample: 6.7% are level 2.
property_age_yearsYearBuilt, T0T0_year − YearBuiltAge, not static year. Miami-Dade 2024-03 sample: median 48 yrs, range 2–124.
leverage_ratioCLBTV, CLTV, LTVcoalesce(CLBTV, CLTV, LTV) + leverage_source tagOne consolidated loan-to-value (0–100). Collapses 3 collinear columns; source tag preserves provenance.
lot_size_sqftLotSizeSqFt (canonical)passthrough; drops acres/depth/frontage/listing/FA-est variantsParcel land size, one unit.
building_area_sqftBuildingArea, SumBuildingSqFtcoalesce(BuildingArea, SumBuildingSqFt); drops 5 other sum-area variantsTotal building footprint.
living_area_sqftSumLivingAreaSqFt (canonical)passthrough; drops listing-derived variantsInterior livable area (excl. garage/basement/unfinished).
valuation_gapMarketTotalValue, AssdTotalValueMarketTotalValue / AssdTotalValueMarket premium over assessor book. >1 ⇒ hidden equity.

Lot, building and living area are kept as three distinct concepts (they measure genuinely different things) — but within each concept one canonical column is chosen and the rest dropped. The leverage_ratio priority is CLBTV → CLTV → LTV: CLBTV (combined loan balance to value) is the most current; CLTV is at origination; LTV is the single-loan current ratio.

Transaction-history & date-diff synthetics

FeatureRaw inputsFormulaNotes
price_appreciationCurrentAVMValue, PrevSalesPriceCurrentAVMValue / PrevSalesPriceUnrealised gain. NULL handled by has_prev_sale sidecar.
has_prev_salePrevSalesPrice1 if not null else 0Missing prior sale is signal (new build / inherited / never transacted), not noise.
months_since_prev_sale under auditT0, PrevSaleRecordingDate(T0 − PrevSaleRecordingDate).monthsOwner tenure in months. Finer than DaysOwnership. Leakage-sensitive (date-diff).
mortgage_age_months under auditMtg1RecordingDate(T0 − Mtg1RecordingDate).monthsAge of first mortgage — refi proxy. Leakage-sensitive (date-diff).
listing_duration_months under auditIsListedFlagDate(T0 − IsListedFlagDate).monthsHow long listed for sale. Leakage-sensitive (date-diff).
t0_month, t0_quartert0 (YYYY-MM)int(t0[5:7]); (month−1)//3 + 1Seasonality — spring-selling-season effect (Apr–Jun peak).

Leakage-sensitive — under active audit. months_since_prev_sale, mortgage_age_months and listing_duration_months are the three date-diff features under an active leakage audit. Their contribution is not yet validated; do not treat it as final until the ablation runner completes.

Distress trajectory synthetics

For each of the 23 distress types the builder materialises up to three signals — an active flag, time-active, and time-since-resolved — for up to 23 × 3 = 69 distress signals (the 4 distress types without a date column get only the active flag).

FeatureFormulaMeaning
<d>_active1 if TRY_CAST(<d> AS DOUBLE) ≥ 1.0 else 0Is the distress event currently active?
<d>_months_active(T0 − <d>DistressDate).months when activeHow long the event has been active
<d>_months_since_resolved(T0 − <d>DistressDate).months when inactiveHow long since the event resolved
n_active_distressessum(<d>_active) over 23 typesTotal stress count (0–23) — more signals ⇒ higher sale probability
preforeclosure_progress_pct(months_active × 30) / state_foreclosure_median_daysFraction of the state's typical foreclosure timeline elapsed — cross-state comparable

Key insight (validated 2026-04-18 across 8 counties): <d>DistressDate is bi-directional — when active=1 it is the event START; when active=0 it is the RESOLUTION. Re-emergence rate (1→0→1) is 0.31%, confirming the feature engineering is safe.

Reference-table & external joins

FamilySourceExamples
State foreclosure profilesrc/new_model/ref/state_foreclosure_profile.csv (50 states + DC, 2-auditor verified vs foreclosurelaw.org)foreclosure_type, state_foreclosure_median_days, state_redemption_days, deficiency_allowed
External macro (Tier E, FRED)data/cache/macro/*.parquet, aligned by t0mortgage_rate_30yr (MORTGAGE30US), fed_funds_rate (FEDFUNDS), unemployment_rate_national (UNRATE), home_price_index_national (CSUSHPINSA) + 6m/12m deltas
External micro (Tier F, BLS + Census + FHFA)data/cache/micro/*.parquet, aligned by (fips, t0)unemployment_rate_county (BLS LAUS, 531 counties), median_income / homeownership_rate / median_age / median_home_value (Census ACS 5-yr), hpi + hpi_yoy_pct (FHFA state)

Outlier filter (applied after synthetics, before modelling)

RuleDrop row if…
Bedrooms> 50
Bathrooms> 50
Stories> 100
property_age_years< 0 or > 400

Impossible values are almost always data corruption — the filter drops the row (does not clip the value), preserving legitimate edge cases such as large multi-family complexes with many bathrooms.

Where this connects

The bucketed feature tiers (A physical · B owner+distress · C valuation+activity · D date-diffs under audit · E national macro · F local market context) are defined on the methodology page. The column-level reference lives in the Gold variable inventory; how each feed is provenance-tracked is in Data sources; and how raw permit rows are classified before they become labels is in Permit classification. Back to the Data hub or the Overview.

Rendered from notes/variable_taxonomy.md + notes/synthetic_features.md.