OverviewResources › Demographics catalog

Demographics catalog

The free-government demographic cache at data/demographics/ — pulled once, reused across every model and every model version. This is the per-source reference.

Both models read property records (First American parcels, BuildZoom permits) and enrich them with free, nationwide, public-government context — income, vacancy, broadband, affordability, hazard, climate. That context lives in a curated parquet cache at data/demographics/, joined to property records on standard zero-padded FIPS string keys. It is not specific to one build; deleting it costs hours of bandwidth and dev time to re-pull (no auto-refresh). The plain-language permit↔property match is on Data sources; this page is the source-by-source catalog underneath it.

52
source directories (Wave-10 completeness audit)
134
main parquets per MANIFEST.json
29.1M
total rows (auto-inventory 2026-05-27)
1.73 GB
total parquet disk

Counts vary by what is being measured: the README "Complete Inventory" reports 44 source families / 134 main parquets / 29,136,543 rows / 1.73 GB; the Wave-10 directory audit counts 52 source dirs (196 parquets incl. raw/sub-tables). Both are reported as-is — the family vs directory distinction is the difference. Vintage range across the cache is 2010–2025.

The eight load-bearing tiers

These eight families carry the primary socioeconomic, count, leverage, affordability, employment and infrastructure signal. All grains are joined on zero-padded FIPS strings (see Join keys below). Row counts and Maricopa 04013 anchor values are from the per-wave QA audits (2026-05-27/28).

SourceGrainJoin keyVintage · safe T0 floorWhat it provides
ACS 5-yr
(Census)
Block-Group · Tract · ZCTA · Countygeoid
(12/11/5/5)
2019–2023 (live) + 2018–2022 (backfill); center ≈2021/2020 · T0 ≥ 2025-Q1 / 2024-Q1139 curated variables across 30 tables — median HH income, home value, gross rent, tenure, year built, education, mobility, commute. BG = finest socioeconomic context. Maricopa MHI $85,518 (2023) / $80,675 (2022).
Decennial 2020
(Census)
Block · Block-Group · Countygeoid_bg · fips5
(15/12/5)
April-1 2020 PIT · T0 ≥ 2021-Q3100%-count population + housing-unit + vacancy. Freshest exact vacancy. Block file 8.13M rows; Maricopa county pop 4,420,568, hu_vacant 169,248.
IRS SOI
(IRS)
USPS ZIP (not ZCTA)zip52018–2022 (5 vintages) · TY2022 T0 ≥ 2025-Q1Income-distribution by ZIP — 5 AGI brackets per ZIP (multi-row). Mortgage-interest density / leverage proxy. Column is zip5 not zipcode.
HUD CHAS / FMR / IL
(HUD)
Tract · County · ZIPcounty_fips · geoid · zip5CHAS 2018–2022 (T0 ≥ 2025-Q1) · FMR/IL FY25 (forward-looking, already safe)Cost-burden / affordability bands (CHAS, 23 sub-tables), Fair Market Rents (FMR), AMI Income Limits (IL), plus Small-Area FMR by ZIP. Maricopa 2BR FMR $1,950.
BLS QCEW
(BLS)
County × NAICS · Statecounty_fips · state_fips2024 annual · T0 ≥ 2025-Q3Employer-side wage / employment cycle by industry. 264,923 county rows (audited).
FCC BDC
(FCC)
County · Place · CBSA · CongDist · State · Tribal · Nationalcounty_fipsDec 2024 collection · T0 ≥ 2025-Q1Broadband availability by technology / speed — infrastructure premium. County file 366,510 rows. Maricopa 100% any-tech, 92% gigabit. Plus a mobile (BDC-mobile) family.

HMDA (CFPB mortgage applications, 2022+2023, county) appears in the README inventory alongside the eight tiers as a mortgage-origination signal — 152,065 (2022) + 141,940 (2023) county rows; county_fips 5-digit string, Maricopa present both years.

Additional families present (documented)

The cache also holds many more free-government families beyond the eight above. Listed here are only those with a spec/audit in notes/demographics/. Row counts from the README auto-inventory and per-wave audits; some sources carry documented gaps (flagged).

FamilyGrainJoin keyVintageWhat it provides · notes
BEA regional incomeCountycounty_fips2019–2023Per-capita / personal income by county. 2019 has 3,113 rows vs 3,114 later (1-county delta, cosmetic).
CDC SVI · PLACES · WONDERCounty · Tractcounty_fips · geoid2022 (SVI), 2024 (PLACES), 2020–2023 (WONDER)Social Vulnerability Index, local health prevalence, mortality. PLACES Maricopa 993 tracts vs SVI 1,009 (PLACES suppresses low-pop).
EPA EJSCREEN · Walkability · AQSBlock-Group · Countygeoid · geoid20 · county_fips2024 (EJ/Walk), 2020–2024 (AQS)Environmental-justice indicators, walkability index, air quality. AQS sparse (~1,030 monitored counties/yr; absent = no monitor, not clean air).
FEMA NRI · NFHLCounty · Tractcounty_fips · tract_geoid · fips52024 (NRI)National Risk Index (natural-hazard risk) + flood-hazard layer. Use county_fips/tract_geoid, not legacy STCOFIPS.
NOAA storms · normalsCounty · Eventcounty_fips2015–2024 (storms), 1991–2020 (normals)Storm-event history (943,100 events audited) + 30-yr climate normals. 61% of storm events were zone-based, crosswalked to counties; 154 normals counties via nearest-neighbor (has_station=False).
USDA RUCC · RUCA · SNAP · Food Atlas · FARCounty · Tract · ZIPfips5 · tract_geoid · zip52010 (RUCA), 2020 (FAR), 2023 (rest)Rural-urban continuum, commuting areas, SNAP participation, food-access atlas, frontier codes.
FHFA HPICounty · CBSA · State · ZIP3fips_code · cbsa · state_fips · zip3time-seriesHouse Price Index. County annual 106,252 rows, 2,795 unique FIPS, Maricopa present.
Housing market — Zillow · Redfin · NAR EHSCounty · ZIP · MSA · Stategeoid · fips_code · cbsa · state_fipstime-seriesZHVI / ZORI / sale-price / velocity / inventory (Zillow 8,677,707 rows, 7 files), weekly market tracker (Redfin), existing-home-sales (NAR). See open issues below.

Other documented dirs the audits also cover: Census secondary (CBP, ZBP, PEP, HVS, QWI, AHS, LODES), IRS Migration + County, IRS/ACS migration flows, SAIPE, US Drought Monitor, Tigerline (geocoding shapes), HUD ZIP crosswalk, BLM PAD-US, WRLURI, NHTSA FARS, FDIC SOD, NCES CCD, USGS NLCD, MS USBF, USPS vacancy, SBA loans, energy/minerals, DOE LEAD, FBI UCR, FRED macro.

Join keys

All keys are zero-padded strings — never integers (CLAUDE.md rule #1; 04013 not 4013). Across the 44 families the geo-key column travels under 10 different names; the canonical ones:

KeyFormatExampleUsed by
state_fips2-digit04all sources
county_fips / fips55-digit04013ACS · Decennial · HUD · BLS · FCC · HMDA · BEA · CDC · EPA · FEMA · property records
tract_geoid11-digit04013010101ACS · HUD CHAS · CDC · FEMA NRI tract
bg_geoid / geoid12-digit040130101011ACS · Decennial · FCC · EPA EJSCREEN/Walkability
block_geoid15-digit040130101011001Decennial · FCC
zcta5-digit85003ACS only
zip5 / zip_usps5-digit85003IRS SOI · HUD SAFMR · Census ZBP · USDA

Vintage-safety + join gotchas (no future leak)

Features must be computable from the T0 snapshot alone (CLAUDE.md rule #2). Each source publishes with a lag, giving it a hard T0 floor below which using it is future leakage. Below that floor → fall back to the prior vintage or NaN. Safe-T0 floors (from README):

SourceSnapshotPub lagSafe T0 ≥
ACS 5-yr 2019–2023rolling avg ~2021~14mo2025-Q1
Decennial 2020April-1 2020 PIT~18mo2021-Q3
IRS SOI 2022tax year 2022~24mo2025-Q1
HUD CHAS 2018–2022ACS derivative~36mo2025-Q1
BLS QCEW 2024calendar 2024~6mo2025-Q3
FCC BDC Dec 2024Dec 2024 collection~4mo2025-Q1
HMDA 2023calendar 2023~9mo2024-Q4

Four gotchas to apply before any join

ZCTA ≠ USPS ZIP. ZCTA is the Census polygon approximation of a mail ZIP, not 1:1. ~3–5% of property records have a USPS ZIP with no matching ZCTA, and ~2% of ZCTAs cross county lines. Bridge via the HUD ZIP↔ZCTA crosswalk. Prefer block-group / tract joins when lat/lon is available; only fall back to ZCTA when you have just a ZIP.

ACS sentinel -666666688. Census encodes "estimate not available" as this numeric sentinel — the parquets retain it raw (NaN-normalization only covers textual nulls). Filter it before use: df['mhi'].where(df['mhi'] > 0). Example: BG 040130101021 has b19013_001_e = -666666688.

FCC county_fips INTEGER recast. The original VALIDATION_REPORT flagged 12 FCC parquets storing county_fips as INTEGER. County-level files are now fixed to zero-padded VARCHAR; non-county grains carry the column as 100%-null (correct — those grains have no county key). The remaining FAIL is fcc_bdc_mobile provider_summary provider_id as int64 — but that is an FCC identifier, not a FIPS, so cast-on-read suffices.

Prefixed / multi-row keys. HUD CHAS geoid is summary-level-prefixed (1400000US04013010102 — strip with SUBSTRING(geoid, 10) or LIKE); HUD FMR county FIPS is the first 5 chars of a 10-digit fips; IRS SOI is multi-row per ZIP (5 AGI brackets — aggregate or join on (zip5, agi_bracket)); Tigerline tract county_fips is 3-digit (use geoid[:5]).

Known gaps & status

The cache is curated, not finished. Honest open items from the QA audits:

ItemStatusDetail
Redfin ZIP weeklyFAIL — empty0-row parquet; ETL never wrote data. Raw TSV (1.5 GB) exists; re-ingest required.
Redfin county fips_codeFAIL — 99.8% nullOnly 4 FIPS populated; region_name + state_fips + cbsa usable as workaround pending FIPS re-derive.
USGS NLCD countyFAIL — partial2,287 of ~3,143 counties (856 missing); re-pull required.
MS USBFFAIL — stub16-row AZ+DC stub; national pull never completed.
ACS BG 34 dead columnsWARN34 cols 100% NULL at BG (B07001/B08006/B17001/B19083/B25097 — not published at BG). No data loss; schema bloat only.
FBI UCR · FRED macroWARN — empty dirsSpec + script exist, zero parquets. planned

Overall validation across the cache: 115 PASS · 6 WARN · 13 FAIL (VALIDATION_REPORT totals). The 13 FAILs are the 12 FCC INTEGER-FIPS files (since recast) plus the empty Redfin ZIP. Tier-F (local market context) wiring of these into the REI feature builder is in progress — three model features (listing_duration_months, months_since_prev_sale, mortgage_age_months) remain under leakage audit and are unvalidated under audit.

Where this connects

Summary card and deep-link target: Data sources → US demographics cache. The property-record side (how permits match to parcels) is on the same page. The geocoding helper that resolves lat/lon → block-group geoid for these joins lives at src/new_model/geocoding.py; the full per-source specs at notes/demographics/<source>_spec.md and the cross-source manifest at data/demographics/MANIFEST.json.

Rendered from notes/demographics/ (README + DATA_CATALOG + specs + audits).