OverviewETL / quality › BuildZoom walkthrough
BuildZoom

BuildZoom ETL — the data's journey, in human language API → Bronze → Silver → Gold · audit notes · 2026-05-19 hover any dotted term for plain-English

One-line summary. A 4-table raw ingestion (Bronze, Hudi COW) joins into a single denormalized fact table (Silver), then enriches with First-American property identifiers via a 3-condition address match (Gold). Quality is gated at each layer by a QA framework with DynamoDB tracking and Slack alerts. The pipeline is technically solid but its contract with downstream consumers is incomplete — naming says "Roofing" when contents are all-permits, no permit_category enum is exposed, and the date-sanitize gate lets 2066 through. There is no Platinum tier — ML-ready data is built ad hoc by each consumer today.

1 · Architecture — the swim-lane view

Five actors, each with a distinct responsibility. (This is a swim-lane diagram, BPMN-style.) Read left to right:

BuildZoom API (vendor) S3 Landing zone EMR Spark + Hudi (Bronze) EMR Spark + Hudi (Silver) EMR Spark + Hudi (Gold) Observability (DynamoDB, Slack, CloudWatch) BuildZoom CSV exports buildzoom-landing-zone Permits Classif. Statuses Properties Roofing_Master_V1 (denormalized fact) Roofing_Master_Gold_V1 (+ FA_PROPERTYID[ ]) DynamoDB execution logs Slack alerts via Lambda Real_Estate_Master (FA, silver)

Read this diagram as a contract

Each lane owns one job. Bronze ≠ Silver ≠ Gold — when the boundaries blur (e.g. "Roofing_Master_Gold" contains non-roofing rows), downstream consumers compensate, and that compensation is exactly the problem we're trying to fix.

2 · Bronze — raw ingestion, four tables

Bronze is the literal mirror of the BuildZoom vendor extract. CSVs land in S3, an EMR Spark job reads them with an explicit schema, runs a QA gate, and writes Hudi COW (copy-on-write) tables to s3://8020rei-bronze-data-lake/. Nothing is joined here.

The four tables (one job each)

TableRecord keyPartitionWhy this partition
Roofing_Building_Permits_V1 BUILDING_PERMIT_ID PARTITION_YEAR from INITIAL_STATUS_DATE Time-pruned scans (most queries ask "permits in year X")
Roofing_Building_Permit_Classifications_V1 COMPOSITE_KEY = BUILDING_PERMIT_ID|PROJECT_TYPE none Small table (~hundreds of MB); a permit may carry multiple PROJECT_TYPE rows that later collapse into an array
Roofing_Building_Permit_Statuses_V1 ID = BUILDING_PERMIT_ID + PERMIT_STATUS + EFFECTIVE_DATE PARTITION_YEAR from EFFECTIVE_DATE Status timeline per permit (multiple statuses per permit, kept as history)
Roofing_Properties_V1 PROPERTY_ID PARTITION_FIPS = COUNTY_FIPS Geographic pruning (queries scan a few counties at a time)

Quality gate before write

Every Bronze write must pass a 6-check QA framework before the Hudi write commits. The checks are configurable per table via shared/configs.py; defaults below.

CheckThreshold (default)What it catches
Null % critical fields< 1 %Missing primary keys (BUILDING_PERMIT_ID, PERMIT_NUMBER)
Null % important fields< 10 %Missing context (PROPERTY_ID, TYPE, jurisdiction)
Duplicate % on record key< 0.6 %Pipeline producing two-of-the-same
Completeness score (weighted)≥ 75 %Overall field-fill health
Negative % positive-only fields< 0.1 %JOB_VALUE < 0, SQUARE_FEET < 0 …
Minimum record count≥ 1Empty deliveries

If QA fails: data is NOT written, the run is logged to DynamoDB with QA_FAILED, and a Slack alert fires. If QA passes: Hudi write commits, then a success record is logged. This is a real strength of the pipeline.

Date sanitization (important)

Before QA runs, all date columns are clipped:

year > 2100  → 2100-12-31
year < 1900  → 1900-01-01
otherwise    → keep as-is

This prevents Spark long-overflow but is too permissive for ML. A date in 2066 (and we have observed these in the 2026-05 vintage) passes the sanitizer and survives into Silver and Gold. See Issue #2.

Hudi configuration

3 · Silver — the fact table join

Silver is one Spark job (silver/Build_Zoom_Silver_Master.py, ~470 lines) that turns four Bronze tables into a single denormalized fact table: Roofing_Master_V1. Each row = one permit, with property and classification glued on.

The 3-join recipe

Permits + STATE extracted Properties UPPER + COUNTY_FIPS int Classifications PROJECT_TYPE→ARRAY Statuses window: latest only ⋈ on (STATE, PROPERTY_ID) join 1 ⋈ on BUILDING_PERMIT_ID join 2 (left) ⋈ on BUILDING_PERMIT_ID join 3 (left) Roofing_Master_V1 one row per permit

What Silver actually does to the data

This is where the "Roofing" name first misleads

The table is called Roofing_Master, but its filter is not a roofing filter. It's "every permit on every property" — a generic permits fact table. The name is a project-history artifact (the 8020REI vertical is roofing), not a content description.

4 · Gold — First-American address matching

Gold takes Roofing_Master_V1 (silver) and links each permit to one or more First-American property IDs. The output column FA_PROPERTYID is an array — a single permit can resolve to multiple FA properties.

The 3-condition match cascade

Silver permit row + FIPS_Salting (skew) 1. Std address + ZIP + FIPS StdDeliveryLine + StdZipCode (strongest match) 2. APN (parcel) + FIPS PARCEL_NUMBER == APN (deterministic when present) 3. Street + ZIP + FIPS STREET4JOIN (no whitespace) (fuzzy fallback — one→many) collect_set into FA_PROPERTYID array Gold permit row + FA_PROPERTYID[ ]

How the cascade resolves

The conditions are tried in order. (1) Standardized address — most accurate when present. (2) APN match — deterministic when both sides have one. (3) Street + ZIP fallback — fuzzy, can resolve one permit to many properties. A permit can match more than one condition (different FA properties for each), so the final array can contain multiple IDs. The conditions are tracked separately as FA_PROPERTYID1, FA_PROPERTYID2, FA_PROPERTYID3 internally before being combined.

FIPS salting — why the code is more complex than it looks

A handful of "big" counties (Maricopa 04013, Harris 48201, Miami-Dade 12086, etc.) hold a disproportionate share of records. A naive Spark partition-by-FIPS would put 60M Maricopa rows on one executor while others sit idle — classic data skew. The salting layer reads a config from S3 (SALTING_CONFIG_PATH) and rewrites the partition key:

The output column is Partition_FIPS_Salting. This is an operational hack done well — it absorbs skew but does not change the data semantics.

4b · The temporal contract — why we need event_date + observation_period

Property ID identifies the parcel — a permanent thing. Owner identifies who lived there, and owners change over time. To answer "who owned this property when the permit started" we need both: a property key AND a time anchor. The schema today exposes property ID but no canonical time anchor — eight date columns, no contract.

This proposal would derive two columns in Gold, used by every downstream consumer the same way. None of this exists today.

Gold row · 6 date columns APPLIED_DATE 2024-05-15 ISSUED_DATE 2024-06-02 INITIAL_STATUS_DATE 2024-05-15 COMPLETED_DATE 2024-09-10 CANCELLED_DATE NULL LATEST_STATUS_DATE 2024-09-10 Which one is "the" date? → everyone picks differently event_date = MIN(non-null dates) = MIN(2024-05-15, 2024-06-02, 2024-05-15, 2024-09-10, 2024-09-10) → 2024-05-15 observation_period = month_end(event_date) − 1 month = month_end(2024-05-15) − 1m = 2024-05-31 − 1m → 2024-04 Silver REM 2024-04 snapshot Why "− 1 month"? Owner / distress state in the same month as the permit is contaminated by it. The previous month-end snapshot reflects who the owner was BEFORE the permit started.

With this contract every model joins the same way: gold.property_id == rem.property_id AND rem.period == gold.observation_period. The owner returned is who lived there the month before the permit started — no leakage, no per-consumer divergence.

Edge case to document

If every date column is NULL the row gets event_date = NULL and the as-of join silently drops it. Surface this as a row-count metric in the QA gate so partitions with high NULL-date rates fail loud instead of disappearing.

5 · QA gate — same shape at every layer

Each layer (Bronze and Silver) uses a layer-specific QA class (BuildingPermitsQA, SilverMasterQA) that runs before the Hudi write. The Gold layer uses GoldMasterQA with the same pattern. Three outcomes:

┌────────────────────────┐ │ Read source / build │ │ dataframe │ └───────────┬────────────┘ │ ▼ ┌────────────────────────┐ │ Run QA checks │ │ (6 default + custom) │ └───────────┬────────────┘ │ ┌───────────────────┼───────────────────┐ ▼ ▼ ▼ ╔══════════╗ ╔══════════╗ ╔══════════════╗ ║ PASS ║ ║ QA_FAILED║ ║ ERROR ║ ╚════╤═════╝ ╚════╤═════╝ ╚══════╤═══════╝ │ Write Hudi │ Skip write │ Skip write │ Log SUCCESS │ Log QA_FAILED │ Log ERROR │ Slack ✓ │ Slack ✗ + reasons │ Slack ✗ + trace ▼ ▼ ▼ Production data Data quarantined Investigation

This pattern is uniform across layers. The output contract is "no QA-failed data ever reaches the next layer." That is real discipline.

6 · The data's journey — mental map

One row's story, end to end, from the moment BuildZoom hands us a CSV until a downstream consumer reads it:

┌─ DAY 1 ────────────────────────────────────────────────────┐ │ │ │ BuildZoom drops CSV in s3://buildzoom-landing-zone/ │ │ Faster_Format/<partition>/ │ │ • building_permits_<p>.csv │ │ • building_permit_classifications_<p>.csv │ │ • building_permit_statuses_<p>.csv │ │ • properties_<p>.csv │ │ │ └──────────────────────┬─────────────────────────────────────┘ ▼ ┌─ BRONZE STEP ──────────────────────────────────────────────┐ │ EMR Spark job per table: │ │ │ │ read CSV(schema) ─► clip dates 1900-2100 ─► │ │ add ts + PARTITION_* ─► run QA ─► write Hudi COW │ │ │ │ → 4 tables in s3://8020rei-bronze-data-lake/ │ │ → execution row in DynamoDB │ └──────────────────────┬─────────────────────────────────────┘ ▼ ┌─ SILVER STEP ──────────────────────────────────────────────┐ │ One job, batched by PARTITION_YEAR: │ │ │ │ load 4 bronze tables ─► │ │ Classifications: groupBy(BPID).collect_list(PROJECT_TYPE) │ │ Statuses: window.row_number() keep latest │ │ Properties: UPPER(CITY,STREET), COUNTY_FIPS as int │ │ ─► permits ⋈ properties on (STATE, PROPERTY_ID) │ │ ─► ⋈ classifications on BUILDING_PERMIT_ID │ │ ─► ⋈ statuses on BUILDING_PERMIT_ID │ │ ─► add ts, run QA, write Hudi │ │ │ │ → Roofing_Master_V1 in s3://8020rei-silver-data-lake/ │ └──────────────────────┬─────────────────────────────────────┘ ▼ ┌─ GOLD STEP ────────────────────────────────────────────────┐ │ One job, FIPS-salted to absorb skew: │ │ │ │ load silver permits + load Real_Estate_Master (FA) │ │ ─► load standardized addresses for both │ │ ─► condition 1: StdAddress + ZIP + FIPS │ │ ─► condition 2 (residuals): APN + FIPS │ │ ─► condition 3 (residuals): Street + ZIP + FIPS │ │ ─► collect FA_PROPERTYID's into ARRAY per permit │ │ ─► run GoldMasterQA, write Hudi │ │ │ │ → Roofing_Master_Gold_V1 in s3://8020rei-gold-data-lake/ │ └──────────────────────┬─────────────────────────────────────┘ ▼ ┌─ DOWNSTREAM (today) ───────────────────────────────────────┐ │ • Callzeke dashboard reads gold directly │ │ • Safeguard reads gold directly │ │ • Apollo (REI scoring) reads gold for Tier-G permit feats │ │ • Roofing model (new_model) reads gold to build labels │ │ │ │ ⚠ each consumer reapplies category / status filtering │ │ independently — divergence-by-construction │ └────────────────────────────────────────────────────────────┘

7 · Issues identified

HIGH = downstream consumers materially affected · MED = corner cases / future risk · LOW = naming / hygiene

The "Issue", "Where", and "Impact" columns describe what exists today (observation). The "Suggested fix" column is PROPOSAL — recommendations, not current behavior.

SevIssueWhereImpactSuggested fix
HIGH Every table in the pipeline is named Roofing_* but contains permits of all categories — this is a permits pipeline, not a roofing pipeline Bronze: Roofing_Building_Permits_V1, Roofing_Building_Permit_Classifications_V1, Roofing_Building_Permit_Statuses_V1, Roofing_Properties_V1
Silver: Roofing_Master_V1
Gold: Roofing_Master_Gold_V1
(Repo name buildzoom-etl and S3 bucket names *-bronze/silver/gold-data-lake are already neutral — only table names mislead.)
The naming assumes a roofing-only product, but the pipeline ingests every permit category that BuildZoom delivers. Empirical evidence: only 3.5 % of rows have TYPE='Roof' in the Pinellas 12 K sample, and only ~18 % contain "roof" anywhere in PROJECT_TYPE (finding 67 §B). Every downstream consumer (Callzeke, Safeguard, Apollo, the new model) re-applies its own roofing filter on top of data labelled as roofing-but-isn't — different filters, different counts, hours of investigation each time numbers diverge. Treat as a coordinated rename across Bronze + Silver + Gold. Suggested mapping:
TodayProposed
Roofing_Building_Permits_V1Permits_Master_V1
Roofing_Building_Permit_Classifications_V1Permit_Categories_V1
Roofing_Building_Permit_Statuses_V1Permit_Statuses_V1
Roofing_Properties_V1Permit_Properties_V1
Roofing_Master_V1 (silver)Permits_Master_Silver_V1
Roofing_Master_Gold_V1 (gold)Permits_Master_Gold_V1
Do it as a version bump (_V1_V2) with the new name rather than an in-place rename — keeps the existing consumers reading _V1 until they migrate. Adds breathing room for the cross-repo coordination (Callzeke dashboard, Apollo, new model, Glue catalog, DynamoDB log tables, Slack templates, CloudFormation specs all touch the names).
HIGH No permit_category enum exposed — separate but related to the rename across Bronze, Silver, Gold Renaming the tables to Permits_* only fixes the label. Until a coarse permit_category enum (ROOFING / HVAC / SOLAR / ELECTRICAL / PLUMBING / NEW_CONSTRUCTION / OTHER / UNKNOWN) is materialized as a column, every consumer still re-classifies from TYPE + SUBTYPE + DESCRIPTION + PROJECT_TYPE with regex, and findings 57 / 62 / 72 each produced different counts on the same data. Pair the rename with a deterministic enum column derived from the existing PROJECT_TYPE array in Silver. Fine sub-classification (e.g., REPLACEMENT vs REPAIR within ROOFING) stays downstream — that's a model decision.
HIGH Date sanitize lets 1900-2100 through, including 2066 bronze/Build_Zoom_Building_Permits.py:82-88 Future-corruption rows (we observed LATEST_STATUS_DATE = 2066-08-23 in the 2026-05-12 vintage) survive Bronze, Silver, Gold. Every label and feature pipeline downstream must re-cap. Clip to a narrower range like [1995-01-01, today + 90 days] in Bronze. Anything outside is a data-quality signal, not a value to preserve.
HIGH No canonical event_date and no observation_period — owner-as-of joins are impossible without them Across Bronze, Silver, Gold — the schema exposes eight date columns (INITIAL_STATUS_DATE, LATEST_STATUS_DATE, APPLIED_DATE, ISSUED_DATE, COMPLETED_DATE, CANCELLED_DATE, AUX_EFFECTIVE_STATUS_DATE, DateSnapshot) and zero canonical "this is the date that defines the permit event". Property ID alone identifies the parcel. To answer "who owned this property when the permit started" you also need a time anchor — and there is no single canonical date in the schema. Every consumer picks a different column: Apollo uses LATEST_STATUS_DATE; the Callzeke audit used arg_max(CITY, permit_date) across the gold permit date; finding 67 ranged over multiple. The owner state in Silver REM changes monthly, so picking the wrong date silently attributes the permit to the wrong owner. Materialize two columns in Gold (or Platinum) as a contract:
  • event_date = earliest non-null date across APPLIED_DATE, ISSUED_DATE, INITIAL_STATUS_DATE, COMPLETED_DATE, CANCELLED_DATE, LATEST_STATUS_DATE. Rationale: "when the permit started being worked on" — the moment the owner-at-the-time made the decision. NULL only when every date column is NULL.
  • observation_period = month_end(event_date) - 1 month (format YYYY-MM). This is the Silver REM snapshot key used for the as-of owner / distress / valuation join. One month before the event so the join uses the state that existed before the permit, not contaminated by it.
Both columns live in Gold so every downstream consumer joins to the same time anchor.
MED Silver collapses status history silver/Build_Zoom_Silver_Master.py · prepare_statuses Only the latest status is exposed (AUX_PERMIT_STATUS). Any analysis that needs the timeline (e.g. how long from APPLIED to ISSUED to COMPLETED, intermediate cancellations) must go back to Bronze. Keep both: AUX_* for latest + a side artifact permit_status_timeline at the per-permit grain (array of {status, date}).
MED No status normalization everywhere LATEST_STATUS mixes case and synonyms (complete, final, issued, canceled, void, expired, withdrawn …). Each consumer redefines what counts as "live" vs "dead." Add a status_normalized column in Silver with a finite enum (ACTIVE / COMPLETE / CANCELED / EXPIRED / PENDING / UNKNOWN) plus the raw value preserved.
MED FA_PROPERTYID as ARRAY surprises downstream gold/Build_Zoom_Gold_Master.py One permit → N FA properties (street-name matches especially) creates implicit fan-out at consumers that UNNEST without knowing. We measured 50 K-row samples where this matters. Document and surface this contract explicitly in a Gold README: "FA_PROPERTYID is a deduped array; consumers must decide whether to UNNEST (fan-out) or treat as a set."
MED JOB_VALUE outliers pass through Bronze QA only checks negative values $135 M permit values pass Bronze ("not negative") and contaminate downstream summaries. Add an upper-tail flag (e.g. job_value_outlier_flag at > p99.9) as a sidecar column — do not cap the value silently.
LOW Hard-coded date defaults from June 2025 shared/configs.py · DEFAULT_PERIOD = '2025-06-29' If read_last_period_executed ever falls back to default (e.g. corrupted DynamoDB row), the pipeline silently restarts from a year ago. Either remove the default and fail loud, or move it to a deploy-time env var.
LOW Silver/Gold READMEs missing repo root Bronze has an excellent README. Silver and Gold do not, so the contract lives only in code. Mirror the Bronze README structure for Silver and Gold (schemas, partitions, joins, QA thresholds).
LOW No Platinum tier — every model rebuilds features locally repo has only bronze/, silver/, gold/ ML consumers materialize feature snapshots and labels in their own caches. Reproducibility is per-consumer. See section 9.

8 · Opportunities ALL PROPOSAL

Every card in this section describes work that does not exist today. These are recommended improvements that the current architecture would absorb cleanly — not observations of current behavior.

Beyond fixes, here are higher-leverage improvements:

Silver  Add permit_category enum

One deterministic mapping from PROJECT_TYPE array → coarse class. Lives in shared/utils.py, tested with a finite truth table. Unlocks multi-vertical without touching the rest of the pipeline.

Silver  Status timeline as array

Keep AUX_* for latest. Add permit_status_timeline = array of {status, date, days_in_status}. No new joins; one window function more.

Gold  Per-condition match provenance

Surface which of the 3 conditions matched (already computed internally — just keep the column). Lets downstream weight by match confidence (cond-1 strongest, cond-3 fuzzy).

Gold  Generalize the table name

Rename Roofing_Master_Gold to Permits_Master_Gold. The contents are already generic; the name is the artifact.

Bronze  Narrow date clip

Clip [1995-01-01, today + 90 days] and reject outliers loudly. Pipeline already has the QA hook; this is a 6-line change.

Bronze  Outlier flags, not caps

JOB_VALUE, SQUARE_FEET, FEES: add side-flags at p99.9 cuts without altering the source value.

Platinum  Vertical-aware feature snapshots

One ETL produces (T0, FIPS) feature snapshots that serve every vertical model (roofing, HVAC, solar). Labels per vertical live in vertical-scoped sub-paths. See section 9.

Gold  Canonical event_date + observation_period

event_date = earliest non-null of the 6 status dates ("when the permit started being worked on"). observation_period = event_date month-end minus 1 month — the Silver REM snapshot key for joining to the owner / distress / valuation that existed before the permit. Materialize both in Gold so every downstream model joins to the same time anchor. See section 4b above for the diagram + edge cases.

Ops  Coverage report as first-class artifact

Today coverage is reconstructed by scripts/roofing/build_coverage_universe.py in another repo. Move the per-(FIPS, jurisdiction) coverage table into Gold or Platinum as a canonical output so consumers don't keep rebuilding it.

8b · Missing cleaning layer between Bronze and Gold PROPOSAL

None of this exists today. The current Silver job is structurally a join — it consolidates the four Bronze tables into a denormalized fact, but its cleaning work is minimal: uppercase CITY and STREET, cast COUNTY_FIPS to int, coalesce missing STATE to "00", keep the latest status. Everything else flows through to Gold unchanged. Result: Gold is still dirty — future dates survive, OCR typos survive, status case-variants survive, sentinel values survive, outlier JOB_VALUE survives.

Each downstream consumer ends up re-doing the same cleaning work locally — inconsistently. This is exactly the kind of repeated work a tier is supposed to eliminate.

What the cleaning layer would do

BRONZE (raw mirror, immutable) │ ▼ ┌────────────────────────────────────────────────────────────┐ │ CLEANING LAYER (new, between Bronze and Silver) │ │ │ │ · clip dates strictly to [1995-01-01, today + 90d] │ │ (today's clip 1900-2100 lets 2066 through) │ │ · normalize LATEST_STATUS to a finite enum │ │ (lowercase + dedupe synonyms: complete/final → │ │ COMPLETE, void/cancelled/canceled → CANCELED, …) │ │ · derive permit_category (ROOFING/HVAC/SOLAR/…) │ │ from PROJECT_TYPE array │ │ · flag value outliers (JOB_VALUE > p99.9, SQUARE_FEET │ │ > p99.9) as sidecar columns — do NOT cap silently │ │ · detect sentinel / constant columns and emit warning │ │ · normalize address: trim, collapse whitespace, drop │ │ OCR artefacts ("linsincludes" type fragments) │ │ · derive event_date + observation_period (see § 4b) │ │ │ └─────────────────────┬──────────────────────────────────────┘ │ ▼ SILVER (now: pure join + denormalization, all values clean) │ ▼ GOLD (now: pure enrichment, FA matching)

Two architectural choices

OptionShapeProsCons
Option A — new sub-tier Bronze → Bronze_Cleaned → Silver → Gold Bronze stays a literal immutable mirror (the gold rule of raw data lakes). Cleaning is a separate ownable artifact with its own QA gate. Easy to compare cleaned-vs-raw rows for audit. Another tier to maintain. Storage cost doubles for the cleaned mirror (although clustering can manage this).
Option B — harden Silver Bronze → Silver (now: clean + join) → Gold No new tier. Same path Silver already takes; just more work per row. Silver's job grows from "join" to "join + clean" — bigger function, harder to test in isolation. Audit trail of cleaning vs raw is lost (cleaned values overwrite originals).

Recommendation

Option A if storage is cheap (it is — S3 + Hudi compresses well). The clean separation between "raw mirror" and "cleaned for analytics" makes the contract auditable: every cleaning rule lives in one place, every cleaned row carries a provenance pointer back to the Bronze row it came from. Silver and Gold simplify in return.

This layer also becomes the natural home for the rename (issue #1), the permit_category enum (issue #2), the date sanitize fix (issue #3), the event_date + observation_period contract (section 4b), the status normalization (issue #5), the outlier flagging (issue #7). Today these are scattered across "Bronze sanitize is too loose," "Silver doesn't normalize," and "every consumer re-derives." A cleaning layer collapses them into one job with one contract.

8c · Integrate QA + observability with the matrix hub PROPOSAL

None of this exists today. The current pipeline's observability is self-contained: DynamoDB tables (buildzoom_etl_logs, buildzoom_qa_metrics) + Slack webhook + CloudWatch. To know "did anything fail this week?" or "what's the trend of completeness scores?" requires logging into DynamoDB and writing queries.

The requirement: every Bronze / Cleaning / Silver / Gold / Platinum QA run should write its result to the central matrix hub on Eritwini (clarify platform name) so observability lives where the rest of the company already looks — not in pipeline-specific logs that only DE can read.

What "matrix hub integration" means concretely

CurrentProposed
QA result lands in DynamoDB.buildzoom_qa_metrics Same row also pushed to the matrix hub via API / webhook
Failures alert Slack channel only Failures alert Slack and create an entry in the matrix-hub incident view (visible to non-DE stakeholders)
Run history lives in DynamoDB tables, only queryable via AWS console / scripts Run history surfaced as a matrix-hub dashboard: per-table, per-vintage, per-QA-rule. Time-series of completeness scores, duplicate %, null %, etc.
Each consumer (Callzeke dashboard, Apollo, new model) builds its own "did the upstream pipeline run cleanly?" check Single matrix-hub query answers "is upstream healthy?" for every consumer
Schema drift detection lives in QA code, surfaced as Slack alerts Schema drift surfaced as a matrix-hub dashboard tile with history

Why this matters (beyond convenience)

  • Cross-team visibility. DS and analytics teams shouldn't need DE help to verify upstream health. Today they do.
  • Audit trail. Compliance / external review can read the matrix-hub log without ever touching pipeline infrastructure.
  • Trend detection. Slack alerts catch acute failures; the matrix hub catches slow drifts (completeness score sliding from 88 → 82 → 76 over six weeks).
  • Single source of operational truth. Same pattern as for product KPIs and business metrics — pipeline health should live in the same place.

Open clarification

"Eritwini" — please confirm the exact platform name (suspecting transcription of an internal hub or a vendor like Airtable / Datadog / Monte Carlo / a homegrown tool). Once confirmed, the integration approach (REST API push, S3 sync, webhook, native connector) follows from the platform's supported ingestion methods.

9 · Where Platinum fits (and why) PROPOSAL

None of this exists today. The current repo has Bronze, Silver, Gold — no Platinum tier. Today every downstream model materializes its own feature snapshots and labels, divergently. This section describes what would close that gap.

Bronze raw vendor mirror Silver denormalized fact Gold FA-enriched, joinable Platinum (NEW) ML-ready labels + features Models roofing · HVAC · solar · …

What Platinum should produce

ArtifactGrainLives atBuilt from
Labels (per vertical) (property_id, T0) platinum/labels/<spec_v>/<gold_vintage>/<FIPS>/ Gold + a versioned SPEC.md defining "what counts as a positive"
Feature snapshot (shared across verticals) (property_id, T0, FIPS) platinum/features/<feature_set_v>/<T0>/<FIPS>/ Silver REM (as-of T0) + Gold permit history per category + macro
Training set (labels ⋈ features) (property_id, T0) platinum/training_sets/<run_id>/ One labels artifact + one feature artifact, frozen with a manifest
Coverage table (FIPS, jurisdiction, T0) platinum/coverage/<coverage_v>/<T0>/ Gold + Silver SFH universe

Why a separate tier (and not a richer Gold)

  • Gold serves many consumers. Adding model-specific as-of joins or label decisions to Gold would force every consumer to inherit them. Platinum is consumer-specific (per-vertical, per-spec_v).
  • Spec lives with the domain. Whether a permit "counts" depends on the model's question (REPLACEMENT-only roofing vs all-permits HVAC). That decision belongs with the data scientist, not the data engineer.
  • Reproducibility. A Platinum manifest pins (Gold vintage SHA, spec version, feature_set version, contract violations at build time, row counts per drop rule). Every training run is auditable from a single JSON.

Recommendation

Add a platinum/ directory to this ETL repo. Production Platinum ETL runs on the same EMR / Hudi infrastructure (no new vendor). The label / feature spec lives in the DS repo (where the domain knowledge is) and is referenced by version from the Platinum build. Same operational discipline as Bronze / Silver / Gold; new contract.

11 · Deep technical audit — 3 specialist passes

Methodology

3 independent agents read the full repo (8 205 LOC across 11 PySpark scripts + CloudFormation), each with a different lens: Performance & cost, data quality & correctness, code review. 90+ findings combined. The table below distills only the cross-cutting and HIGH-severity items — every claim cites file:line. Recommendations carry the PROPOSAL tag because no fix has been implemented yet.

11a · Critical correctness bugs (must-fix; silent data loss or wrong output)

SevBugFile:lineWhy it matters
HIGH Config.QA_THRESHOLDS attribute doesn't exist — every QA-class fallback path raises AttributeError shared/utils.py:840 (vs the real attribute Config.DEFAULT_QA_THRESHOLDS at :47) Today masked because every caller passes qa_thresholds=... explicitly. The day someone forgets, the entire Bronze QA stage crashes. Single-character bug fix.
HIGH Hudi options misspelled hudi.datasource.* instead of hoodie.* — silently ignored by Hudi shared/utils.py:573-580 · bronze/Build_Zoom_Permit_Classifications.py:48-53 Three schema-validation guards never run: allow.schema.mismatch, schema.validate, reconcile.schema. The team thinks schema is being validated; it's not. Vendor-side schema changes pass through invisibly.
HIGH Silver join on derived STATE silently drops permits and creates fan-outs silver/Build_Zoom_Silver_Master.py:151-158, 184-190 STATE = upper(split(PERMIT_JURISDICTION, "_")[0]). Any jurisdiction whose first underscore-token isn't a USPS code ("CHICAGO_IL", "NEWYORK_NY_MANHATTAN", formats without underscores) silently maps to a wrong state → join misses every property. Permits with null jurisdiction get STATE="00" and join to every STATE="00" property — unintentional cartesian explosion.
HIGH COUNTY_FIPS cast to int strips leading zeros — violates project hard-rule #1 bronze/Build_Zoom_Properties.py:54-63 · silver/Build_Zoom_Silver_Master.py:91 Alabama 010011001. Partitions write to PARTITION_FIPS=1001/; every external lookup (Census, FRED, BLS, FHFA, ACS) joins on the 5-char string and silently returns zero rows. Internal cross-references also break.
HIGH EFFECTIVE_DATE allowed NULL in composite Hudi record key — distinct rows silently collide bronze/Build_Zoom_Permit_Statuses.py:42-63 · shared/configs.py:114 RECORD_KEY = concat_ws("-", BUILDING_PERMIT_ID, PERMIT_STATUS, EFFECTIVE_DATE). concat_ws skips nulls — so two genuinely distinct status events with null date collapse to the same key and Hudi upsert silently overwrites. Per config comment ~30 % of statuses have null date.
HIGH Clustering health check reports row-count regression as warning, exits 2 (non-fatal) bronze/Hudi_Clustering_Maintenance.py:413-424 · silver/Hudi_Clustering_Maintenance.py:291-301 Pre-vs-post row count mismatch >1 % is logged as a warning with comment "might be OK if data was being written concurrently." It is not OK — that's exactly the data-loss / duplication scenario. Step Functions treats exit 2 as non-fatal; silent loss with a yellow icon.
HIGH DynamoDB execution log has no ConditionExpression — concurrent writes overwrite each other shared/utils.py:178-242, 297-339 Cursor (read_last_period_executed) computed by client-side sort over full history; writes are unconditional put_item keyed on (table_name, execution_timestamp_iso). Two jobs landing in the same second silently overwrite; two concurrent workers both decide "partition X is next" and both write SUCCESS — duplicate Hudi upserts plus a lying log.
HIGH Athena snapshot swap is non-atomic — old data deleted before new is confirmed queryable gold/Build_Zoom_Gold_Master.py:920-986 Three-step sequence: ALTER LOCATION → ALTER TBLPROPERTIES → delete old prefix. Between steps the table can return zero rows or scan dead paths. No smoke-test query before the delete. If property update fails, no rollback. If delete succeeds but downstream issue forces rollback, data is gone. Polling loop while True: sleep(1) is unbounded — stuck Athena hangs Gold forever.
HIGH Gold address matching has no key normalization — silently misses matches gold/Build_Zoom_Gold_Master.py:319-323, 347-356, 385-389 (1) "ST" vs "STREET" vs "AVE"/"AVENUE" — straight string compare, no USPS expansion. (2) Suite / unit / apt suffixes not normalized ("APT 4B""#4B""UNIT 4B"). (3) ZIP_CODE cast to Integer strips leading zeros — every New England state (MA, CT, RI, NJ, VT, ME, NH) loses ZIP matching. (4) APN format varies by county (LA: 1234-567-890 vs 1234567890; Cook: 12-34-567-890-1234) — no regexp_replace, Cook can lose 100 % of APN matches.
HIGH Gold aggregate_properties groupBy on all 60+ columns; coalesce kills multi-condition matches gold/Build_Zoom_Gold_Master.py:421-491 (1) Two roofing rows differing in even one nullable text field produce two Gold rows for the same BUILDING_PERMIT_ID instead of one with a 2-element array. (2) Code uses coalesce(FA_PROPERTYID1, FA_PROPERTYID2, FA_PROPERTYID3) before collect_list — so a permit matching FA "a" by std-address AND FA "b" by street name only keeps ["a"]. The docstring's promise "Multiple FA_PROPERTYIDs aggregated into array" is contradicted by the code. (3) collect_list not collect_set, order non-deterministic — consumers doing FA_PROPERTYID[0] get different answers per rebuild.
HIGH F.current_timestamp() as Hudi precombine key — re-runs are non-deterministic shared/utils.py:723 · all four Bronze drivers (Permit_Statuses:69, Properties:69, Permit_Classifications:121, Building_Permits via transform) Re-ingesting the same source file produces different precombine values; you can never reliably skip a duplicate or verify "is this the same row as last week". Worse: in the same Spark stage, executor-local clocks can resolve out of order, so a tie-break inside Hudi picks arbitrarily.
HIGH Silver latest-status window ties not broken — re-runs pick different "latest" silver/Build_Zoom_Silver_Master.py:118-129 Window.partitionBy("BUILDING_PERMIT_ID").orderBy(EFFECTIVE_DATE.desc()). Two statuses on the same day → row_number()=1 non-deterministic. Same Bronze → different Silver across runs.
HIGH PERMISSIVE CSV mode silently nullifies bad rows — no columnNameOfCorruptRecord sidecar bronze/Build_Zoom_Building_Permits.py:67-75 + analogous in all 4 Bronze readers A row that fails to parse becomes typed-nulls instead of being quarantined or logged. The QA's null-percentage gate may not trip if only one field is corrupt. No count of rejected rows surfaces anywhere.
HIGH BuildingPermitsQA mutates class attributes at runtime across 4 drivers bronze/Build_Zoom_Building_Permits.py:350-352 + 3 analogous (Properties, Permit_Statuses, Permit_Classifications) Each driver does qa_checker.CRITICAL_FIELDS = SomeConfig.CRITICAL_FIELDS on a single instance. Works only because each driver runs in a separate process. Any consolidation (test harness, notebook) bleeds state between tables. Class is also misnamed for one of the four tables it now serves.

11b · Performance & cost HIGH wins

Estimated achievable: 40-60 % wall-clock reduction + 30-50 % cost reduction on the Gold layer alone (the most expensive). Numbers are agent estimates; require real billing data to confirm.

FixMechanismImpact
Eliminate Gold's 16+ wide-shuffle .count() calls for logging gold/Build_Zoom_Gold_Master.py:151,182,226,271,341,373,402,440,445,451,452,488. Replace per-step counts with one aggregated .agg(F.count_when(...)) at the end; gate behind logger.isEnabledFor(DEBUG). 30-50 % Gold wall-clock
EMR cluster lifecycle — keep one cluster across Bronze + Silver + Gold Today three independent Step Functions each spin EMR up/down (~8 min × 3 = 24 min bootstrap per run). Single state-machine with sub-state-machines on one cluster ID. 24 min/run + 3× bootstrap cost
Add EMR managed scaling + spot TASK group All three CFNs hard-code 3 × r7i.4xlarge core, no managed scaling, no TASK group, no spot, gp2 EBS. Bronze tolerates spot churn (Hudi rollback). Switch EBS to gp3 (20 % cheaper, faster). 2-4× throughput/$
Don't re-read the entire Hudi target for QA on every partition bronze/Build_Zoom_Building_Permits.py:109 + analogous. QA's new-vs-updated computation does spark.read.format("hudi").load(hudi_path) with no filter — scans 483 M rows for classifications, >50 M for properties, on every partition. Project only the record-key column and filter by source's partition keys before the join. 50-80 % per-partition QA time
Replace 60-column groupBy with key-only aggregation + back-join gold/Build_Zoom_Gold_Master.py:480-484. matched_df.dropDuplicates(['BUILDING_PERMIT_ID']).drop('FA_PROPERTYID') then join the collect_list back on BUILDING_PERMIT_ID alone. Huge — 60-tuple shuffle eliminated
Switch bulk_insert mode from per-partition append to single union write gold/Build_Zoom_Gold_Master.py:612-660 + loop at :873-881. bulk_insert is intended as a single shot — 67 calls inside mode("append") repeatedly cycles Hudi metadata. Either union all partitions into one DataFrame before writing, or scale hoodie.bulkinsert.shuffle.parallelism dynamically by row count. 30-60 % Gold write time
Bump hoodie.cleaner.commits.retained from 1 to 3 shared/utils.py:586. Current value strips all rollback ability and works against COW write amplification. Recovery margin + fewer cleanup writes
Stop using df.limit(1) upsert to trigger clustering The "limit-1 upsert with inline clustering" trick rewrites real file groups on every "trigger" call (bronze/Hudi_Clustering_Maintenance.py:290, silver:237, gold/Run_Clustering.py:79). Use Hudi native CALL run_clustering(...) procedure; on failure, log and skip — never write phantom records. 20-40 % clustering cost
Single Spark session builder; eliminate 7× copy-pasted boilerplate Build_Zoom_Building_Permits.py:294-334 + 6 analogous. shuffle.partitions, coalescePartitions, executor sizing all drift between copies. Extract shared.spark.make_session(). Consistency + faster onboarding
DynamoDB query: add GSI or use Limit + ScanIndexForward=False shared/utils.py:198-216, 262-275, 369-394. Today every job start scans full history. O(N) DynamoDB RCU growth eliminated

11c · Silent-failure data-quality issues

SevFailure modeFile:line
HIGH"00" sentinel STATE creates fan-out matching (covered in 11a)silver/Build_Zoom_Silver_Master.py:151-158
HIGHPARTITION_YEAR=9999 sentinel for null INITIAL_STATUS_DATE rows is never in batch_partitions filter → Silver silently drops themshared/utils.py:716 · silver/Build_Zoom_Silver_Master.py:148
HIGHProperties schema_evolution=true + auto.evolution.column.type.override=STRING means a new vendor column lands as STRING regardless of true type — silent type coercionshared/utils.py:573-580
HIGHSalt_Bucket=0 → "SMALL_PARTITION" literal groups all small-FIPS into one bucket — cross-county APN collisions become possible (BuildZoom permit in FIPS A matches FA property in FIPS B by APN string)gold/Build_Zoom_Gold_Master.py:91-93, 354-356
HIGHClassifications collect_list not collect_set — duplicate PROJECT_TYPE rows in Bronze produce ["Roofing", "Roofing"] arrays in Silversilver/Build_Zoom_Silver_Master.py:67-69
MEDMulti-line DESCRIPTION with embedded newlines breaks CSV parsing (multiLine not set on 3 of 4 readers)bronze/Build_Zoom_Building_Permits.py:67-75 + 2 analogous
MEDextract_partitions_inside_folder picks first match by sort if vendor accidentally drops a backup file (e.g. .csv.bak) — no assertion that exactly 1 file matchesbronze/Build_Zoom_Building_Permits.py:419-422 + analogous
MEDNo referential integrity check: every BUILDING_PERMIT_ID in Permit_Statuses should exist in Building_Permits — orphan status rows silently inflate Silverno test exists
MEDCompleteness score is gamable — weighted average over all columns means adding always-full Hudi meta columns dilutes bad fieldsshared/utils.py:988-1009
MEDNo value-distribution drift check vs previous partition — a jurisdiction dropping 80 % of permits goes unflaggedshared/utils.py:1042-1060
MEDQA "initial-load" path treats df_target = None as initial — if the target read fails silently (permission denied), every record becomes "new" and you ship a full duplicateshared/utils.py:923-924
MEDSilver's "Real Estate Master period" is "freshest S3 prefix" — not deterministic across Gold partition batches if Real Estate publishes mid-Gold-rungold/Build_Zoom_Gold_Master.py:200-203

11d · Cross-cutting architectural concerns

1. shared/utils.py is a 2 853-line megamodule

It contains Config, schemas, S3 helpers, DynamoDB helpers, checkpoint logic, Hudi write helpers, transformations, Slack helpers, and three near-duplicate QA classes (BuildingPermitsQA, SilverMasterQA, GoldMasterQA — ~85 % identical, including a bug fix made in Silver/Gold but missing from Bronze around _hoodie_* column filtering). Every change has high blast radius and every test requires booting Spark + faking AWS. Split into shared/{spark,config,s3,dynamodb,hudi,qa,notifications}.py with a thin BronzePipeline / SilverPipeline / GoldPipeline base class to absorb the partition-loop boilerplate (currently copy-pasted 4×) and the Spark-session builder (currently copy-pasted 7×).

2. The execution model assumes serial Step Functions and no human re-runs

Nothing in code enforces "one writer per table". Protection is by convention — README says "ONLY when no ETL jobs are running"; CLAUDE.md says "don't edit mid-run". A senior shop would put a DynamoDB-backed lock around each (table_name, partition) and around the clustering job, with TTL and ownership records, and would make every put_item use ConditionExpression. Until that exists, the _V1 / _V2 versioning scheme is the only rollback you have — and snapshot deletion in Gold (issue 11a #8 above) erases it.

3. Idempotency is partial and silently asymmetric across layers

Bronze uses upsert with a non-deterministic timestamp precombine — safe but expensive on re-run. Silver runs full-refresh upsert — safe but rewrites the entire universe per run. Gold uses bulk_insert — fast but not idempotent if a single partition re-runs (no built-in dedup). Three layers behave differently under the same Step Functions retry policy. Standardize on one strategy per layer with documented retry semantics and surface "this partition has been written N times this snapshot" in DynamoDB so monitoring catches partial-success retries.

4. No CI / lint config — the pipeline ships to prod with no static analysis gate

No ruff, no mypy, no black, no requirements.txt visible. Dead imports, misspelled Hudi options, missing type hints, and dead args accumulate without anyone noticing. This is the cheapest single fix on the entire list — a 1-hour add-CI PR catches future regressions of every category in 11a / 11c.

How to act on this section

The 13 HIGH bugs in 11a are correctness issues — they produce wrong output today, not just future risk. The performance wins in 11b are agent estimates that need real billing data to confirm. The full agent transcripts (90+ findings) are too long to embed here; they live in the agent task outputs and can be re-derived by re-running the three specialist agents over the repo (Performance Benchmarker, Data Engineer, Code Reviewer).

12 · Anexo — independent verification of section 11 findings

Methodology

The 13 HIGH-severity bugs in §11a were originally surfaced by three specialist agents in a single pass. To avoid rubber-stamping, two independent verifier agents (Code Reviewer + Data Engineer) were dispatched separately, told not to trust the original wording, and asked to read the cited code fresh. They produced verdicts of CONFIRMED / PARTIAL / EXAGGERATED / REFUTED / REQUIRES-DATA. The "Consensus" column below reconciles their two views.

12a · Consensus table

#Finding (one-line)V1V2ConsensusWhat changes from §11a
F1Config.QA_THRESHOLDS attribute missingCONFIRMEDCONFIRMEDCONFIRMEDNone. Pure typo, latent because every caller passes qa_thresholds=.
F2Hudi options hudi.datasource.* silently ignoredCONFIRMEDCONFIRMEDCONFIRMEDRefined fix: not a simple prefix swap — 2 of 3 keys (allow.schema.mismatch, schema.validate) have no real hoodie.* equivalent. Author intent ("validate + reconcile") needs translation to actual Hudi keys: hoodie.datasource.write.reconcile.schema + hoodie.avro.schema.validate.
F3Silver join on derived STATE drops & fan-outsCONFIRMEDCONFIRMEDCONFIRMEDBoth failure modes empirically defensible. Magnitude data-dependent (depends on real PERMIT_JURISDICTION formats), but the "00" sentinel collision is structural — happens the first time any null arrives on either side.
F4COUNTY_FIPS cast to int strips leading zerosCONFIRMEDCONFIRMEDCONFIRMEDBronze actually handles nulls/non-numeric with a "00000" sentinel at :54-60 — and then casts to int at :63, throwing away the work. The author knew FIPS is a string then ignored it.
F5NULL EFFECTIVE_DATE in composite Hudi keyCONFIRMEDCONFIRMEDCONFIRMEDconcat_ws null-skip behavior verified in Spark 2.x / 3.x / 4.x. Per config comment ~30 % nulls → wide collision surface.
F6Clustering row-count drift >1% becomes warning + exit 2PARTIALPARTIAL / EXAGGERATEDPARTIALThe drift-to-warning downgrade is real. The "exit 2 is silent" claim depends on how the Step Functions caller treats non-zero exits. REQUIRES-DATA: confirm the State Machine doesn't special-case exit 2. Bigger architectural smell: the "concurrent writes" excuse implies clustering can race with ETL writers — that's the real bug.
F7DynamoDB execution log raceCONFIRMEDPARTIAL / EXAGGERATEDPARTIAL — refinedReal bug is different from original framing. The "two concurrent writes within the same second" is unlikely at ISO-microsecond precision. The genuine concern: read_last_period_executed does a full unpaginated DynamoDB Query, response is truncated to 1 MB by AWS — at ~10 K items the cursor silently misses older successes that may be the actual max-period. Fix: add pagination loop.
F8Athena snapshot swap non-atomic + delete unverifiedPARTIALCONFIRMEDCONFIRMED — refinedCode does check Athena query state == SUCCEEDED (V1's correction). What's still missing: a SELECT smoke-query against the new location after the ALTERs and before the delete. Polling loop is unbounded — stuck query hangs forever. Two-ALTER sequence is non-atomic.
F9Gold address matching has no key normalizationCONFIRMEDPARTIALPARTIAL — refinedStd-line + Std-zip ARE handled upstream by the Standardized_Addresses_* tables — Condition 1 in Gold consumes them correctly. Real remaining gaps: (a) APN normalization (raw PARCEL_NUMBER == APN, no regexp_replace, format varies per county), (b) ZIP_CODE cast to IntegerType drops leading zeros (whole New England loses ZIP joins). Condition 3 street normalization opaque — needs more investigation.
F10Gold coalesce + 60-col groupBy kills multi-condition matchesCONFIRMEDREFUTED-as-stated, real bug presentCONFIRMEDBoth verifiers converged: bug is real, mechanism is exactly what was claimed (coalesce runs BEFORE collect_list, kills multi-match). V2's "refuted-as-stated" was about minor framing wording, not the bug substance.
F11F.current_timestamp() as precombine non-deterministicCONFIRMEDPARTIAL / EXAGGERATEDPARTIAL — context-dependentUnder single-writer architecture (this pipeline today — Step Functions serializes), current_timestamp() is safe enough. The risk bites if (a) anyone re-runs a partition expecting idempotency, (b) two parallel jobs ever touch the same key (forbidden by convention but not by code). Status as a HIGH depends on whether the team wants re-run idempotency.
F12Silver latest-status window tie not brokenCONFIRMEDCONFIRMEDCONFIRMEDSpark window sort is provably not stable across the shuffle exchange. Reproducibility broken. Compounds with F5 (null-date status collisions) — a permit with 3 null-date statuses has 3 ties at the top → any can be "latest".
F13PERMISSIVE CSV silently nullifies bad cells, no audit trailCONFIRMEDCONFIRMEDCONFIRMEDProperties.py is worse than originally stated — it sets no mode at all, defaulting to PERMISSIVE implicitly (other three Bronze readers at least set it explicitly). Per-cell null (not whole-row null) — equally dangerous because it's silent.

12b · Final tally

8

CONFIRMED

F1 · F2 · F3 · F4 · F5 · F8 · F10 · F12 · F13

5

PARTIAL / REFINED

F6 · F7 · F9 · F11
(real bug, refined framing or context-dependent)

0

FULLY REFUTED

No finding survived as completely wrong.

12c · What, why, how — for each CONFIRMED finding

Compressed implementation-level summary. Numbers in parens are the original finding ID from §11a.

IDWHAT (the bug)WHY (consequence)HOW (one-line fix)
F1Config.QA_THRESHOLDS doesn't exist — fallback path crashesLatent AttributeError; today masked by every caller passing kwarg explicitlyshared/utils.py:840: Config.QA_THRESHOLDSConfig.DEFAULT_QA_THRESHOLDS
F23 schema-validation options use hudi. prefix instead of hoodie. — silently ignoredSchema-validation guards never run; opposite of intent ships to prodReplace with real keys: hoodie.datasource.write.reconcile.schema + hoodie.avro.schema.validate. Drop the third (no real equivalent).
F3Silver derives STATE via upper(split(JURISDICTION, "_")[0]) with "00" fallback; Properties does same with raw STATEPermits with non-USPS jurisdiction → lost from join. Null jurisdiction → cartesian fan-out against all "00" properties.Replace coalesce-to-sentinel with null-safe equals + explicit drop count; carry raw PROPERTY_STATE from Properties and join on that.
F4COUNTY_FIPS + PARTITION_FIPS cast to IntegerTypeAlabama 010011001. S3 partition path mangled. External FIPS lookups (Census, FRED, BLS, FHFA, ACS) return zero rows.Keep StringType with F.lpad(col.cast("string"), 5, "0") — no int cast at any layer.
F5concat_ws("-", BPID, STATUS, EFFECTIVE_DATE) as RECORD_KEY with EFFECTIVE_DATE allowed NULL — concat_ws skips nullsTwo distinct status events with null date collapse to same Hudi key; upsert silently overwrites; per config ~30% of statuses have null dateF.concat_ws("-", BPID, STATUS, F.coalesce(EFFECTIVE_DATE.cast("string"), F.lit("__NULL__")))
F6Clustering row-count drift >1% logged as warning + exit 2; "might be OK if concurrent writes" commentPossible data loss/duplication exits with non-zero code but Step Functions handler MAY treat as success. Concurrent-write excuse signals deeper coordination gap.Promote drift to hard failure (sys.exit(1)). Acquire write lock on table before clustering. Document exit-code contract.
F7read_last_period_executed does unpaginated DynamoDB Query → response truncated to 1 MB at ~10K items → may miss older successesCursor advancement is wrong at scale; jobs may re-process completed partitions or skip uncompleted onesAdd pagination loop with LastEvaluatedKey; better, replace with UpdateItem on a single "last_known_success" row with conditional expression.
F8Athena LOCATION swap = 2 ALTERs + delete-old, no SELECT smoke-test between, polling loop unboundedInconsistent window between ALTERs (zero-row queries possible); stuck Athena hangs forever; failed second ALTER leaves table broken + old data already goneSingle Glue catalog API update (transactional). SELECT COUNT(*) smoke-test after ALTERs, before delete. Bound polling with max_seconds=300. Soft-delete to quarantine/ with lifecycle rule.
F9Gold address matching: APN raw equality with no normalization, ZIP cast to int strips leading zerosAPN format varies by county (LA 1234-567-890 vs 1234567890; Cook 12-34-567-890-1234) → silent miss. New England ZIPs 070217021 (both sides same int cast, so they match each other — but downstream ZCTA joins via Census fail). Std-line + std-zip are handled upstream.regexp_replace(APN, "[^A-Z0-9]", "") on both sides + per-FIPS APN format profile. Cast ZIP to lpad(string, 5, "0"), not int.
F10coalesce(FA_PROPERTYID1, FA_PROPERTYID2, FA_PROPERTYID3) picks first non-null BEFORE collect_list; 60-col groupBy keeps spurious row splitsPermit matching FA-A by std-address AND FA-B by APN keeps only FA-A — promised multi-match arrays don't materialize. Docstring contradicts code.Replace coalesce with array_distinct(array(id1, id2, id3)) per row, then groupBy(BUILDING_PERMIT_ID).agg(array_distinct(flatten(collect_list(...)))). Drop the 60-col group.
F11F.current_timestamp() as Hudi precombine — value depends on processing time, not source dataRe-ingestion is not a no-op; re-run "wins" over original; concurrent writers raceFor Hudi precombine, use source-derived date (e.g. LATEST_STATUS_DATE) or driver-side lit(datetime.utcnow().isoformat()) materialized once per write. Safe under current single-writer model.
F12Latest-status window: orderBy(EFFECTIVE_DATE.desc()) with no tiebreaker → row_number()=1 non-deterministic on tiesSame input → different Silver across runs. Reproducibility broken. Compounds with F5: null-date ties at the top.Add a deterministic secondary key: .orderBy(EFFECTIVE_DATE.desc_nulls_last(), F.col("PERMIT_STATUS").desc(), F.col("ts").desc()) — or define a domain status priority.
F13PERMISSIVE CSV mode with typed schema nullifies bad cells per-cell; no columnNameOfCorruptRecord → bad rows discarded silently. Properties.py doesn't even set the mode (implicit PERMISSIVE).Vendor format drift produces silent nulls (e.g. JOB_VALUE = "N/A" → null), QA threshold max_null_pct_critical_fields = 1% may not trip on narrow driftsAdd .option("columnNameOfCorruptRecord", "_corrupt_record") + sidecar column; count and emit to QA metrics. Set explicit mode on Properties.

Final note on this verification pass

Zero findings were fully refuted. Eight are confirmed outright, five are confirmed-with-refined-context. The five PARTIALs do not mean "go ahead and ignore" — they mean the recommended fix or the severity rating shifts. Treat all 13 as real items in the engineering backlog; the verifier nuance is what informs prioritization, not whether to fix at all.

12d · Architectural logic audit — structural issues (separate from code bugs)

Why this section exists

The bugs in §11a are code-level — typos, race conditions, missing normalizers. They can be fixed without touching the architecture. A 4th independent agent (Software Architect) audited the logic of the pipeline itself — layer responsibilities, contract boundaries, naming, dependency graph, reproducibility model. Goal: find structural issues that mean "even if every bug were fixed, this design would still be wrong in important ways."

Top 5 structural issues

#IssueWhereSeverity
S1 Silver advertises a partition key (PARTITION_FIPS) that the Silver job never creates. Config declares PARTITION_KEY = "PARTITION_FIPS"; the Silver job batches by PARTITION_YEAR and produces a PARTITION_FIPS column only for permits whose property-join succeeds (when join fails → NULL FIPS → row silently disappears from any Gold filter on FIPS). Worse: every Silver batch touches every FIPS partition file — defeats partitioning as an idempotency boundary. The "zero-downtime full refresh" is a marketing label, not a property. shared/configs.py:270 + silver/Build_Zoom_Silver_Master.py:164-283 HIGH
S2 The STATE column shadows two different concepts and is the silent root of join brokenness. Properties: STATE = postal state from assessor. Permits: STATE = re-derived in Silver from upper(split(PERMIT_JURISDICTION, "_")[0]). Same column name, completely different meaning. Join treats them as equal — they are not. BuildZoom permit jurisdictions are city/county/agency identifiers whose prefix is sometimes a state code, sometimes a FIPS string, sometimes a city slug. Layer-responsibility leak: Silver re-derives a column with the same name to make a join work. silver/Build_Zoom_Silver_Master.py:94, 150-154, 186-190 HIGH
S3 Snapshot semantics are pinned to Bronze, but Gold output fuses 4 independently-versioned upstream tables. Gold builds snapshot suffix from get_last_bronze_date_snapshot() and writes Roofing_Master_Gold_V1_<YYYYMMDD>. But the join body reads (1) Silver Roofing_Master with no period filter, (2) Silver Real_Estate_Master filtered by S3-folder-listed last_period, (3-4) two Standardized_Addresses tables with no version pin. Three of these tables are not even built by this repo. One Gold "snapshot" stamps a Bronze date onto an output fused from 4 vintages; reproducibility from Gold row → source is broken at the moment of writing. Athena ALTER + delete-old is destructive — no rollback after the flip. gold/Build_Zoom_Gold_Master.py:734-772, 188-209 + shared/utils.py:245-282 HIGH
S4 Gold mixes data engineering with business decisions — the FA enrichment is a domain join, not a tier. The medallion definition would have Gold contain aggregated business-ready facts. What this Gold actually contains is a probabilistic entity-resolution result: 3-condition fuzzy address match, fall-through coalesce with implicit trust ordering, multiple FA_PROPERTYIDs collected into an array. The match policy is invisible to consumers (no per-row quality flag); tuning the match logic requires a full Gold rebuild (10⁸ rows); the 1:N permit→property fanout becomes an array column without a documented grain change. gold/Build_Zoom_Gold_Master.py:8-13, 277-469 MED-HIGH
S5 The Roofing_* naming is a domain lie that will block multi-vertical scaling. Bronze contains every BuildZoom permit category. The labels are aspirational ("for the 8020REI Roofing vertical"), the contents are generic. Not cosmetic: a future HVAC or solar vertical forces either (a) re-ingesting the same BuildZoom data into HVAC_* tables (1× storage + 1× Hudi maintenance per vertical) or (b) accepting "Roofing_Master" as the master for everything (in which case the name is a permanent lie). Current code locks in option (a) — Step Functions stack name, CFN, QA classes all encode "Roofing". Whole repo · configs.py:30, 69, 112, 152, 254, 325 MED

Other structural issues worth tracking

CategoryFinding
Layer leakProperties Bronze casts FIPS, derives partitions, generates DateSnapshot — that's Silver-cleanup work done in Bronze. Silver then re-casts COUNTY_FIPS again.
Layer leakSilver collapses Statuses (event-stream) to "current status per permit" via window function — that's a business aggregation that destroys history. Gold or a view should own it; Silver should expose the un-collapsed event timeline.
NamingAUX_PERMIT_STATUS / AUX_EFFECTIVE_STATUS_DATE are actually "latest status" — name tells the consumer nothing. There's already a LATEST_STATUS field on permits, creating two columns with overlapping meaning.
NamingS3_TABLE_PATH on RoofingGoldConfig points to a base path that never exists in S3 — real tables live at _<YYYYMMDD> snapshots. QA reads from the non-existent base path.
NamingHUDI_PARALLELISM constant is used as record_size in write_table_with_bulk_insert — name says parallelism, use says size estimate. They are not the same thing.
Dependency designGold has 3 hard dependencies on tables not built by this repo (Real_Estate_Master, two Standardized_Addresses tables). If FA changes its column name PropertyIdpropertyId, Gold silently breaks with empty matches. No upstream contract documented.
Dependency designBronze QA reads Bronze table state from run N−1 to compute deltas. Re-running breaks QA determinism.
IdempotencyThree layers, three different write policies (Bronze upsert, Silver full-refresh-upsert, Gold bulk_insert into new snapshot). There is no single "this end-to-end vintage is consistent" record. Asking "what's the latest fully-consistent (Bronze, Silver, Gold)?" requires reconciling 3 DynamoDB tables on incompatible date formats.
ReproducibilityNo source-file fingerprint stored on any row. A Gold row cannot be traced back to "this BuildZoom CSV, line N." DynamoDB records the partition (2025-12-10) but not the file ETag, byte count, or schema hash.
Schema contractSchema source of truth is a switch statement inside shared/utils.get_hudi_schema(...) — the 2853-line god module. No column-level contract: LATEST_STATUS "important" but no enum / no range / no allowed-value check.
Schema contractSchema evolution policies disagree: Hudi has auto-evolution enabled with type-override to STRING; the Spark reader uses a fixed schema and drops new columns at read. Two policies, opposite directions.
CouplingSilver reads Bronze via spark.read.format("hudi").load(BRONZE_PATH) with no commit pin. If Bronze for the active year is mid-commit (Hudi COW), Silver may read a half-finished file group. Step-Function serialization papers over the issue for daily runs but breaks under manual re-processing.
CouplingAthena DDL string is embedded inside Gold Spark job. Renaming the catalog or table requires editing Python code.
Modeling gapNo SCD treatment on Properties (assessor data, slowly-changing). Hudi upsert overwrites without effective_from/effective_to — historical analysis cannot answer "what did we know about this property on date X?"
Modeling gapNo event view of permits. Statuses is a natural event stream (applied → issued → inspected → completed) but Silver collapses to "latest only". The medallion pattern flattens an event source into a snapshot — the opposite of best practice for lifecycle analysis.
Cross-cuttingshared/utils.py at 2853 lines holds Spark schemas, Hudi writer, DynamoDB tracker, S3 lister, three QA classes, Slack formatter, snapshot resolver. A god module; refactoring it requires re-deploying every layer.
Cross-cuttingSalting config (salting_partitions.json) lives at s3://8020rei-staging-data-lake/diego_staging/ — in someone's personal staging path. Deployment landmine.
Hard-coded assumptionOne vendor (BuildZoom), one enrichment provider (First American, FA_PROPERTYID baked into Gold schema), one region (us-east-1 hard-coded in 3 files), one vertical, one environment. A second of any of these requires editing string literals in 5+ files.
Future-fragilityStep Functions chain is hard-coded serial Bronze → Silver → Gold. A second vertical needs a second state machine + second EMR cluster, or a fork. CFN doesn't support multi-tenancy.
Future-fragilitySnapshot deletion on next successful run means the first successful run after a bug ships overwrites the last-known-good. No rollback.

Architectural verdict

The medallion pattern is conceptually appropriate for this domain — permit data does benefit from a raw mirror, a denormalized fact, and an enriched serving layer. But the implementation has not internalized what each tier owes the next. Bronze does Silver work; Silver does Gold work and also re-does Bronze work it shouldn't trust; Gold does platform work (Athena DDL) and business decisions (match-method priority) that should be a thin policy layer above pure data. The contracts between layers are implicit — no manifest, no version pin, no schema enum — and the moment a downstream consumer wants to ask "is the data I'm reading consistent?" they cannot.

The structural ceiling is single-vendor, single-domain, single-region naming. The codebase is shaped like a one-shot data engineering project for BuildZoom-Roofing, not like a platform. Adding a second vendor doubles the surface area. Adding a second enrichment provider requires schema changes everywhere FA_PROPERTYID appears. Adding a second region requires editing string literals in 5+ files. The Roofing_* prefix, the hard-coded us-east-1, and the _V1 suffix together signal that the team is not yet thinking in terms of a long-lived platform.

A re-design starting today would: (1) name Bronze + Silver after vendor + entity (buildzoom.permits_raw, buildzoom.permits_facts), reserve domain language ("Roofing", "Solar") for Gold serving views; (2) make the precombine key a deterministic source_file_etag + row_number, never current_timestamp(), so re-runs are bit-identical; (3) split entity-resolution out of Gold into a "match candidates" intermediate with match_method + match_score columns, and let Gold be a thin policy view; (4) replace the VERSION_SUFFIX = "_V1" pattern with a snapshot/manifest table that records {bronze_commit, silver_commit, fa_period, std_addr_commit} per run, with retention policy and rollback; (5) move shared/utils.py into 3–4 cohesive modules and treat each layer's input/output schema as a versioned contract, not a Python switch statement; (6) make Step Functions parameterized over {vendor, domain} so a second vertical is a config change, not a fork.

The medallion isn't wrong. The implementation has just not invested in the contracts between tiers — and contracts, not technology, are what makes a lakehouse maintainable as it grows.

How to read structural vs code-level findings together

The 13 HIGH bugs in §11a are tactical fixes. The 5 structural issues in §12d are strategic re-architecture. Fixing the code bugs without addressing the structural ones leaves the same shape vulnerable to the same class of bugs later. Fixing structure without code-level fixes leaves shipping data wrong today. Do both; sequence the structural changes around the natural rebuild cycle (e.g., when the team next versions the tier from _V1_V2).

10 · Glossary & sources

TermMeaning
Hudi COWApache Hudi Copy-on-Write table. Each commit rewrites affected files; readers see consistent snapshots. Single-commit vintages = no version inflation (we verified 1.0× on Pinellas 12103).
Medallion architectureBronze (raw) → Silver (clean, joined) → Gold (enriched, joinable). Industry vocabulary (Databricks-coined).
FIPSFederal Information Processing Standards — 5-digit county identifier (always zero-padded). Used as partition key for geographic pruning.
FAFirst-American — the property-records vendor whose PropertyId Gold links permits to.
FA_PROPERTYIDArray of First-American property IDs matched to a permit via the 3-condition cascade.
FIPS saltingSpark anti-skew technique: hash-distribute big-FIPS rows across N sub-buckets so executors share load.
COMPOSITE_KEYBronze record key for classifications: BUILDING_PERMIT_ID|PROJECT_TYPE (one permit may carry multiple project types).
T0The prediction-time boundary. Features ≤ T0, labels > T0 + horizon. Not in this ETL today; introduced in the proposed Platinum tier.

Sources for this document