India Grid TAM Method and Accuracy

This report documents the revised scoring methodology, the correlation benchmark, and the leakage audit. The headline conclusion is simple: the random-CV score is too optimistic; the honest current accuracy is the spatial-block and city-holdout view.

Leakage verdict: random cross-validation is invalid for this notebook because location alone can reproduce most of the score. The defensible view is purged spatial holdout and city holdout, not the 0.83 random-CV number.

Executive Read

Random CV status
Invalid

Coordinate + city alone reaches 0.856 Spearman, so this split is memorizing location.

Purged spatial 3 km Spearman
0.626

Train rows within 3 km of held-out cells are removed.

Purged spatial 5 km Spearman
0.576

Stricter purge; shows current signal stack is still weak.

City-holdout Spearman
0.536

Unseen-city transfer remains weak.

Method Flow

GeoIQ Grid CSV grid_id, city, TAM, geometry Open India Layers districts, nightlights, POIs, slums, future buildings/roads Feature Frame city geometry + open signals no TAM column as feature Production Score v2 households × residential × income fit × activity Research TAM Distillation train on vendor TAM inside CV diagnostic only, not production Leakage Audit ablation + spatial CV Accuracy Claim use spatial/city holdouts Production score must not use vendor TAM. Distillation uses vendor TAM only to diagnose recoverable correlation.

Production Scoring Formula

The revised score separates demand from serviceability and business priority. This avoids hiding operational weakness inside one TAM number.

gross_tam =
  households_est
  * residential_confidence
  * income_band_prob
  * activity_multiplier

serviceable_tam =
  gross_tam * serviceable_prob

acquirable_tam =
  serviceable_tam * acquirable_prob

priority_score =
  acquirable_tam
  * expected_conversion_rate
  * expected_margin
  * confidence_score
  * (1 + supply_gap_score)
  - service_cost_penalty

Leakage Audit

Test Result Interpretation
Direct target-column scan Pass No forbidden feature columns were found: no `vendor_tam`, `tam_0_10l`, GeoIQ rank, or neighbour-TAM feature is present in the model feature set.
Coordinate + city only, random CV Spearman 0.856 This alone beats the full-feature random-CV score. That proves random CV is measuring location memorization, not real feature quality.
Full current features, random CV Spearman 0.834 Removed from accuracy claims. It is retained only as evidence of leakage through spatially adjacent folds.
Purged spatial holdout, 0.10° blocks, 3 km purge Spearman 0.626 Training cells within 3 km of held-out cells are removed. This is a much stronger local generalization test.
Purged spatial holdout, 0.10° blocks, 5 km purge Spearman 0.576 The stricter purge shows current features are not enough for a robust 0.80 claim.
City holdout Spearman 0.536 The model does not transfer well to unseen cities; city-specific calibration or stronger causal features are required.
Open non-location features, city holdout Spearman 0.467 When city and coordinate surface are removed, current open features alone are weak. We need population, buildings, roads, and satellite welfare.
Shuffled-label sanity check Spearman -0.036 The model collapses when labels are shuffled, so there is no evidence of accidental direct target leakage.

Accuracy by Validation Design

Validation design Pearson r Spearman r Top 10% overlap WMAPE Use as claim?
City median baseline 0.321 0.330 0.202 0.761 Baseline only
Random 5-fold, full current features 0.811 0.834 0.640 0.434 Invalidated
Coordinate + city only, random 5-fold 0.846 0.856 0.666 0.400 Leakage proof
Spatial block 0.03° 0.743 0.774 0.572 0.504 Optimistic spatial
Purged spatial 0.10°, 3 km purge 0.604 0.626 0.455 0.638 Defensible
Purged spatial 0.10°, 5 km purge 0.531 0.576 0.411 0.674 Stricter
City holdout 0.449 0.536 0.321 0.700 Transfer test
Open non-location features, city holdout 0.438 0.467 0.364 0.725 No location crutch

GeoHG-style Feature Builder

The helper script now creates a real local feature bundle inspired by GeoHG: area nodes, 8-neighbour grid edges, semantic entity hyperedges, POI hyperedges, and graph-context features. This is not yet the full PyTorch GeoHG GNN run; it is the auditable feature-building layer using the local data that actually exists in this repository.

Full GeoHG-style spatial Spearman
0.737

Uses semantic features, POIs, graph context, city, and position encodings.

No-position semantic Spearman
0.682

Removes raw lon/lat, city grid row/col, radius, and city-size crutches.

Graph edges
24,198

8-neighbour area-area edges across the vendor grid cells.

Hyperedges
27,049

21,381 semantic entity edges plus 5,668 POI edges.

Local source Feature step Status
CityMind-Lab/GeoHG Method pattern: area nodes, entity hypernodes, POI hypernodes, spatial neighbours Used as structure
india-geodata districts, nightlights, flood atlas, soil District context and semantic entity features Used
india-geodata education, police, airport, energy points POI cell counts, 2 km counts, and POI-area hyperedges Used
WorldCover raster Landcover entity features Missing locally
Hyderabad landuse polygons Landuse entity features Available but no overlap with vendor cities
GeoHG-style spatial-block actual versus predicted correlation scatter plot

Repo Usage by Step

Step Repo / Source Role Status
Grid and joins H3, kraina-ai/srai, yashveeeeeeer/india-geodata Canonical H3 grid, current square grid compatibility, admin joins Used for local joins
Admin backbone ramSeraph/indian_admin_boundaries, SHRUG State, district, village/town, and socioeconomic identifiers Next
Household denominator POPCORN, WorldPop, Census, SHRUG People and households per cell Next
Built environment ramSeraph/indian_buildings Building counts, built-up density, residential likelihood Next
Welfare / affordability EOML-for-India, Village Development Model, satimage Satellite-to-living-standard, roof, lighting, water, development proxies Next
Roads / POIs srai, Hex2Vec, Highway2Vec, india-geodata POIs Urban function, accessibility, commercial anchors POI proxies used
Advanced inference GeoHG Heterogeneous spatial context after transparent baseline Feature builder implemented; GNN later
Business validation Internal leads, installs, serviceability, CAC, churn Reality target; required before production deployment Missing

What to Claim

Do claim

  • The new methodology is stronger than city-level baselines.
  • Current non-neighbour features reach roughly 0.58 to 0.63 Spearman under purged spatial validation.
  • The score architecture is defensible because it separates denominator, target-band fit, activity, serviceability, acquirability, and cost.

Do not claim yet

  • Do not claim production-ready 0.80+ accuracy; random CV is invalidated by location memorization.
  • Do not use neighbour TAM interpolation as a model.
  • Do not claim India-wide generalization until city-holdout and internal outcome validation improve.

Next Accuracy Work

Source Files

Notebook: india_grid_tam_solution.ipynb
Method JSON: tam_scoring_v2_method.json
Strong leakage audit: leakage_audit_strong.json
Vendor-TAM supervised distillation: blocked as feature leakage; no model metrics or predictions are produced.
GeoHG helper script: scripts/build_geohg_features.py
GeoHG feature code: src/tam_geohg/graph_features.py
GeoHG feature metrics: geohg_feature_metrics.json
GeoHG feature manifest: feature_manifest.json