Methodology

The Market Opportunity Engine joins three federal datasets and produces a transparent opportunity score per ZIP code (ZCTA) per business type. There are no black-box weights — every input and subscore is visible in the rankings table.

Data sources

  • Census ACS 5-year estimates — population, age, income, housing, household composition.
  • Census County Business Patterns (CBP) ZIP detail — establishment counts by NAICS industry code at the ZIP level.
  • Census County Business Patterns (CBP) county — same, at county level. Used as a fallback when ZIP-level data is suppressed.
  • Census TIGER/Line — geographic boundaries and ZCTA→county crosswalk.

Scoring formula

Demand (per ZCTA)

  • pop_score — log-scaled population, 0 below 1,000, 1 at 50,000+.
  • income_fit — 0 below the business type's income floor, 1 at its ideal income, linear in between.
  • age_fit — tent function peaking at the business type's ideal age, dropping to 0 at ±tolerance years.
  • demand_score = 0.4·pop_score + 0.4·income_fit + 0.2·age_fit

Supply (per ZCTA)

  • observed_count — sum of establishments matching this business type's NAICS codes from CBP ZIP detail.
  • county_density_per_10k — county-wide establishments / 10k residents (used when ZIP data is missing).
  • estimated_per_10k — final density used for scoring: observed when available, county fallback otherwise.
  • supply_sourceobserved, county_fallback, or no_data.
  • supply_score — percentile rank of estimated_per_10k across populated ZCTAs (0 = least competition, 1 = most).

Composite

opportunity_score = demand_score − 0.5 × supply_score

Why county fallback exists

CBP suppresses entire NAICS rows at the ZIP level when revealing them would expose individual businesses. For narrow industry codes, this creates massive false-zero gaps — we observed ~80% of ZCTAs reporting zero gyms when reality is closer to 50%. County-level CBP has far less suppression because counties are larger and individual businesses harder to identify, so we substitute the county's per-capita rate when ZIP-level data is missing.

Known limitations

  • No service-area awareness. Adjacent ZCTAs share customers but the engine treats each in isolation.
  • National supply baseline. Comparing dense urban ZCTAs to rural ones doesn't fully account for context.
  • NAICS coverage. Some businesses span multiple NAICS codes; the engine matches a curated list per type, not every possible code.
  • Median age is coarse. Daycare scoring would benefit from under-5 population brackets, not median age. (Coming.)
  • No housing/family weighting yet. HVAC scoring should weight housing age; daycare should weight households-with-kids.