India Grid TAM Method and Accuracy
This report documents the revised scoring methodology, the correlation benchmark, and the leakage audit. The headline conclusion is simple: the random-CV score is too optimistic; the honest current accuracy is the spatial-block and city-holdout view.
Executive Read
Coordinate + city alone reaches 0.856 Spearman, so this split is memorizing location.
Train rows within 3 km of held-out cells are removed.
Stricter purge; shows current signal stack is still weak.
Unseen-city transfer remains weak.
Method Flow
Production Scoring Formula
The revised score separates demand from serviceability and business priority. This avoids hiding operational weakness inside one TAM number.
gross_tam = households_est * residential_confidence * income_band_prob * activity_multiplier serviceable_tam = gross_tam * serviceable_prob acquirable_tam = serviceable_tam * acquirable_prob priority_score = acquirable_tam * expected_conversion_rate * expected_margin * confidence_score * (1 + supply_gap_score) - service_cost_penalty
Leakage Audit
| Test | Result | Interpretation |
|---|---|---|
| Direct target-column scan | Pass | No forbidden feature columns were found: no `vendor_tam`, `tam_0_10l`, GeoIQ rank, or neighbour-TAM feature is present in the model feature set. |
| Coordinate + city only, random CV | Spearman 0.856 | This alone beats the full-feature random-CV score. That proves random CV is measuring location memorization, not real feature quality. |
| Full current features, random CV | Spearman 0.834 | Removed from accuracy claims. It is retained only as evidence of leakage through spatially adjacent folds. |
| Purged spatial holdout, 0.10° blocks, 3 km purge | Spearman 0.626 | Training cells within 3 km of held-out cells are removed. This is a much stronger local generalization test. |
| Purged spatial holdout, 0.10° blocks, 5 km purge | Spearman 0.576 | The stricter purge shows current features are not enough for a robust 0.80 claim. |
| City holdout | Spearman 0.536 | The model does not transfer well to unseen cities; city-specific calibration or stronger causal features are required. |
| Open non-location features, city holdout | Spearman 0.467 | When city and coordinate surface are removed, current open features alone are weak. We need population, buildings, roads, and satellite welfare. |
| Shuffled-label sanity check | Spearman -0.036 | The model collapses when labels are shuffled, so there is no evidence of accidental direct target leakage. |
Accuracy by Validation Design
| Validation design | Pearson r | Spearman r | Top 10% overlap | WMAPE | Use as claim? |
|---|---|---|---|---|---|
| City median baseline | 0.321 | 0.330 | 0.202 | 0.761 | Baseline only |
| Random 5-fold, full current features | 0.811 | 0.834 | 0.640 | 0.434 | Invalidated |
| Coordinate + city only, random 5-fold | 0.846 | 0.856 | 0.666 | 0.400 | Leakage proof |
| Spatial block 0.03° | 0.743 | 0.774 | 0.572 | 0.504 | Optimistic spatial |
| Purged spatial 0.10°, 3 km purge | 0.604 | 0.626 | 0.455 | 0.638 | Defensible |
| Purged spatial 0.10°, 5 km purge | 0.531 | 0.576 | 0.411 | 0.674 | Stricter |
| City holdout | 0.449 | 0.536 | 0.321 | 0.700 | Transfer test |
| Open non-location features, city holdout | 0.438 | 0.467 | 0.364 | 0.725 | No location crutch |
GeoHG-style Feature Builder
The helper script now creates a real local feature bundle inspired by GeoHG: area nodes, 8-neighbour grid edges, semantic entity hyperedges, POI hyperedges, and graph-context features. This is not yet the full PyTorch GeoHG GNN run; it is the auditable feature-building layer using the local data that actually exists in this repository.
Uses semantic features, POIs, graph context, city, and position encodings.
Removes raw lon/lat, city grid row/col, radius, and city-size crutches.
8-neighbour area-area edges across the vendor grid cells.
21,381 semantic entity edges plus 5,668 POI edges.
| Local source | Feature step | Status |
|---|---|---|
| CityMind-Lab/GeoHG | Method pattern: area nodes, entity hypernodes, POI hypernodes, spatial neighbours | Used as structure |
| india-geodata districts, nightlights, flood atlas, soil | District context and semantic entity features | Used |
| india-geodata education, police, airport, energy points | POI cell counts, 2 km counts, and POI-area hyperedges | Used |
| WorldCover raster | Landcover entity features | Missing locally |
| Hyderabad landuse polygons | Landuse entity features | Available but no overlap with vendor cities |
Repo Usage by Step
| Step | Repo / Source | Role | Status |
|---|---|---|---|
| Grid and joins | H3, kraina-ai/srai, yashveeeeeeer/india-geodata | Canonical H3 grid, current square grid compatibility, admin joins | Used for local joins |
| Admin backbone | ramSeraph/indian_admin_boundaries, SHRUG | State, district, village/town, and socioeconomic identifiers | Next |
| Household denominator | POPCORN, WorldPop, Census, SHRUG | People and households per cell | Next |
| Built environment | ramSeraph/indian_buildings | Building counts, built-up density, residential likelihood | Next |
| Welfare / affordability | EOML-for-India, Village Development Model, satimage | Satellite-to-living-standard, roof, lighting, water, development proxies | Next |
| Roads / POIs | srai, Hex2Vec, Highway2Vec, india-geodata POIs | Urban function, accessibility, commercial anchors | POI proxies used |
| Advanced inference | GeoHG | Heterogeneous spatial context after transparent baseline | Feature builder implemented; GNN later |
| Business validation | Internal leads, installs, serviceability, CAC, churn | Reality target; required before production deployment | Missing |
What to Claim
Do claim
- The new methodology is stronger than city-level baselines.
- Current non-neighbour features reach roughly 0.58 to 0.63 Spearman under purged spatial validation.
- The score architecture is defensible because it separates denominator, target-band fit, activity, serviceability, acquirability, and cost.
Do not claim yet
- Do not claim production-ready 0.80+ accuracy; random CV is invalidated by location memorization.
- Do not use neighbour TAM interpolation as a model.
- Do not claim India-wide generalization until city-holdout and internal outcome validation improve.
Next Accuracy Work
- Add population and household denominator features from WorldPop, Census, SHRUG, and POPCORN-style occupancy.
- Add building-footprint features from ramSeraph/indian_buildings.
- Add OSM road and POI embeddings through srai, Hex2Vec, and Highway2Vec.
- Add India-specific welfare features from EOML-for-India, Village Development Model, and satimage.
- Evaluate with purged spatial-block CV, city-holdout, time-holdout, and internal install/serviceability outcomes.
Source Files
Notebook: india_grid_tam_solution.ipynb
Method JSON: tam_scoring_v2_method.json
Strong leakage audit: leakage_audit_strong.json
Vendor-TAM supervised distillation: blocked as feature leakage; no model metrics or predictions are produced.
GeoHG helper script: scripts/build_geohg_features.py
GeoHG feature code: src/tam_geohg/graph_features.py
GeoHG feature metrics: geohg_feature_metrics.json
GeoHG feature manifest: feature_manifest.json