India Grid TAM Method and Accuracy

This report documents the revised scoring methodology, the correlation benchmark, and the leakage audit. The headline conclusion is simple: the random-CV score is too optimistic; the honest current accuracy is the spatial-block and city-holdout view.

Leakage verdict: random cross-validation is invalid for this notebook because location alone can reproduce most of the score. The defensible view is purged spatial holdout and city holdout, not the 0.83 random-CV number.

Executive Read

Random CV status

Invalid

Coordinate + city alone reaches 0.856 Spearman, so this split is memorizing location.

Purged spatial 3 km Spearman

0.626

Train rows within 3 km of held-out cells are removed.

Purged spatial 5 km Spearman

0.576

Stricter purge; shows current signal stack is still weak.

City-holdout Spearman

0.536

Unseen-city transfer remains weak.

Method Flow

Production Scoring Formula

The revised score separates demand from serviceability and business priority. This avoids hiding operational weakness inside one TAM number.

gross_tam =
  households_est
  * residential_confidence
  * income_band_prob
  * activity_multiplier

serviceable_tam =
  gross_tam * serviceable_prob

acquirable_tam =
  serviceable_tam * acquirable_prob

priority_score =
  acquirable_tam
  * expected_conversion_rate
  * expected_margin
  * confidence_score
  * (1 + supply_gap_score)
  - service_cost_penalty

Leakage Audit

Test	Result	Interpretation
Direct target-column scan	Pass	No forbidden feature columns were found: no `vendor_tam`, `tam_0_10l`, GeoIQ rank, or neighbour-TAM feature is present in the model feature set.
Coordinate + city only, random CV	Spearman 0.856	This alone beats the full-feature random-CV score. That proves random CV is measuring location memorization, not real feature quality.
Full current features, random CV	Spearman 0.834	Removed from accuracy claims. It is retained only as evidence of leakage through spatially adjacent folds.
Purged spatial holdout, 0.10° blocks, 3 km purge	Spearman 0.626	Training cells within 3 km of held-out cells are removed. This is a much stronger local generalization test.
Purged spatial holdout, 0.10° blocks, 5 km purge	Spearman 0.576	The stricter purge shows current features are not enough for a robust 0.80 claim.
City holdout	Spearman 0.536	The model does not transfer well to unseen cities; city-specific calibration or stronger causal features are required.
Open non-location features, city holdout	Spearman 0.467	When city and coordinate surface are removed, current open features alone are weak. We need population, buildings, roads, and satellite welfare.
Shuffled-label sanity check	Spearman -0.036	The model collapses when labels are shuffled, so there is no evidence of accidental direct target leakage.

Accuracy by Validation Design

Validation design	Pearson r	Spearman r	Top 10% overlap	WMAPE	Use as claim?
City median baseline	0.321	0.330	0.202	0.761	Baseline only
Random 5-fold, full current features	0.811	0.834	0.640	0.434	Invalidated
Coordinate + city only, random 5-fold	0.846	0.856	0.666	0.400	Leakage proof
Spatial block 0.03°	0.743	0.774	0.572	0.504	Optimistic spatial
Purged spatial 0.10°, 3 km purge	0.604	0.626	0.455	0.638	Defensible
Purged spatial 0.10°, 5 km purge	0.531	0.576	0.411	0.674	Stricter
City holdout	0.449	0.536	0.321	0.700	Transfer test
Open non-location features, city holdout	0.438	0.467	0.364	0.725	No location crutch

GeoHG-style Feature Builder

The helper script now creates a real local feature bundle inspired by GeoHG: area nodes, 8-neighbour grid edges, semantic entity hyperedges, POI hyperedges, and graph-context features. This is not yet the full PyTorch GeoHG GNN run; it is the auditable feature-building layer using the local data that actually exists in this repository.

Full GeoHG-style spatial Spearman

0.737

Uses semantic features, POIs, graph context, city, and position encodings.

No-position semantic Spearman

0.682

Removes raw lon/lat, city grid row/col, radius, and city-size crutches.

Graph edges

24,198

8-neighbour area-area edges across the vendor grid cells.

Hyperedges

27,049

21,381 semantic entity edges plus 5,668 POI edges.

Local source	Feature step	Status
CityMind-Lab/GeoHG	Method pattern: area nodes, entity hypernodes, POI hypernodes, spatial neighbours	Used as structure
india-geodata districts, nightlights, flood atlas, soil	District context and semantic entity features	Used
india-geodata education, police, airport, energy points	POI cell counts, 2 km counts, and POI-area hyperedges	Used
WorldCover raster	Landcover entity features	Missing locally
Hyderabad landuse polygons	Landuse entity features	Available but no overlap with vendor cities

GeoHG-style spatial-block actual versus predicted correlation scatter plot

Repo Usage by Step

Step	Repo / Source	Role	Status
Grid and joins	H3, kraina-ai/srai, yashveeeeeeer/india-geodata	Canonical H3 grid, current square grid compatibility, admin joins	Used for local joins
Admin backbone	ramSeraph/indian_admin_boundaries, SHRUG	State, district, village/town, and socioeconomic identifiers	Next
Household denominator	POPCORN, WorldPop, Census, SHRUG	People and households per cell	Next
Built environment	ramSeraph/indian_buildings	Building counts, built-up density, residential likelihood	Next
Welfare / affordability	EOML-for-India, Village Development Model, satimage	Satellite-to-living-standard, roof, lighting, water, development proxies	Next
Roads / POIs	srai, Hex2Vec, Highway2Vec, india-geodata POIs	Urban function, accessibility, commercial anchors	POI proxies used
Advanced inference	GeoHG	Heterogeneous spatial context after transparent baseline	Feature builder implemented; GNN later
Business validation	Internal leads, installs, serviceability, CAC, churn	Reality target; required before production deployment	Missing

What to Claim

Do claim

The new methodology is stronger than city-level baselines.
Current non-neighbour features reach roughly 0.58 to 0.63 Spearman under purged spatial validation.
The score architecture is defensible because it separates denominator, target-band fit, activity, serviceability, acquirability, and cost.

Do not claim yet

Do not claim production-ready 0.80+ accuracy; random CV is invalidated by location memorization.
Do not use neighbour TAM interpolation as a model.
Do not claim India-wide generalization until city-holdout and internal outcome validation improve.

Next Accuracy Work

Add population and household denominator features from WorldPop, Census, SHRUG, and POPCORN-style occupancy.
Add building-footprint features from ramSeraph/indian_buildings.
Add OSM road and POI embeddings through srai, Hex2Vec, and Highway2Vec.
Add India-specific welfare features from EOML-for-India, Village Development Model, and satimage.
Evaluate with purged spatial-block CV, city-holdout, time-holdout, and internal install/serviceability outcomes.

Source Files

Notebook: india_grid_tam_solution.ipynb
Method JSON: tam_scoring_v2_method.json
Strong leakage audit: leakage_audit_strong.json
Vendor-TAM supervised distillation: blocked as feature leakage; no model metrics or predictions are produced.
GeoHG helper script: scripts/build_geohg_features.py
GeoHG feature code: src/tam_geohg/graph_features.py
GeoHG feature metrics: geohg_feature_metrics.json
GeoHG feature manifest: feature_manifest.json