2025/08/09 [GADM-api]

Today I completed the setup for reverse geocoding. The structure is built around two tables:
adm – contains all GADM levels, but without their geometries.
adm_geometries – holds all unique GADM geometries.

The two tables are linked via a geometry hash. Original GADM properties are stored in the adm.metadata column as JSONB. The adm_geometries table also includes extra columns — area_sq_m and bbox — to speed up spatial queries.

The Query

Reverse geocoding will be implemented using ST_Contains.

My initial query looked like this:

EXPLAIN (ANALYZE, VERBOSE, BUFFERS, TIMING, WAL) WITH input_point AS (
  SELECT ST_SetSRID(ST_MakePoint(2.3216141804549526, 48.85511355252443), 4326)::geometry(Point,4326) AS pt
),
ag AS (
  SELECT g.geom_hash, g.area_sq_m
  FROM gadm.adm_geometries AS g
  CROSS JOIN input_point ip
  WHERE ST_Contains(g.geom, ip.pt)
  ORDER BY g.area_sq_m ASC
  LIMIT 1
)
SELECT
  a.lv,
  a.metadata,
  a.geom_hash,
  ag.area_sq_m
FROM gadm.adm AS a
JOIN ag ON a.geom_hash = ag.geom_hash
ORDER BY a.lv DESC
LIMIT 1;

After running EXPLAIN ANALYZE, I found the query took about 170 ms — far too slow. The bottleneck was ST_Contains being called on geometries from all GADM levels. I had expected a point to match only three to five boundaries, depending on the number of levels in a given country. However, the analysis revealed that for my test point in Paris, ST_Contains was executed 15 times.

I needed a way to reduce the number of ST_Contains calls. But first, I wanted to measure how performance varied across different GADM levels. My assumption was that the country level would be the slowest, with each subsequent, smaller level executing faster.

The assumption proved correct — the country-level check was the slowest, at around 150 ms. Surprisingly, all levels below the country performed similarly, taking only about 2.5–5 ms each – not too bad!

This confirmed my initial plan: first, find all geometries whose bounding box contains the point. Then, order them by area and select the smallest one whose geometry actually contains the point.

WITH input_point AS (
  SELECT ST_SetSRID(ST_MakePoint(2.3216141804549526, 48.85511355252443), 4326)::geometry(Point,4326) AS pt
),
candidates AS (
  SELECT g.geom_hash, g.geom, g.area_sq_m
  FROM gadm.adm_geometries AS g
  CROSS JOIN input_point ip
  WHERE ST_Contains(g.bbox, ip.pt)          
  ORDER BY g.area_sq_m ASC
)
SELECT c.geom_hash, c.area_sq_m
FROM candidates c
CROSS JOIN input_point ip
WHERE ST_Contains(c.geom, ip.pt)
ORDER BY c.area_sq_m ASC
LIMIT 1;

This change cut the execution time down to 5–10 ms, which is an acceptable result for the initial implementation. It’s a >10x improvement!

 Limit  (cost=2548.72..7448.94 rows=1 width=33) (actual time=3.869..3.930 rows=1 loops=1)
   Output: g.geom_hash, g.area_sq_m
   Buffers: shared hit=69
   CTE input_point
     ->  Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.088..0.092 rows=1 loops=1)
           Output: '0101000020E6100000F4DE9774AA920240BA3A635C746D4840'::geometry(Point,4326)
   ->  Nested Loop  (cost=2548.71..7448.93 rows=1 width=33) (actual time=3.769..3.771 rows=1 loops=1)
         Output: g.geom_hash, g.area_sq_m
         Join Filter: st_contains(g.geom, ip.pt)
         Rows Removed by Join Filter: 2
         Buffers: shared hit=69
         ->  Sort  (cost=2548.71..2549.69 rows=391 width=11234) (actual time=3.432..3.434 rows=3 loops=1)
               Output: g.geom_hash, g.geom, g.area_sq_m
               Sort Key: g.area_sq_m
               Sort Method: quicksort  Memory: 32kB
               Buffers: shared hit=69
               ->  Nested Loop  (cost=0.28..648.88 rows=391 width=11234) (actual time=2.291..2.674 rows=15 loops=1)
                     Output: g.geom_hash, g.geom, g.area_sq_m
                     Buffers: shared hit=66
                     ->  CTE Scan on input_point ip_1  (cost=0.00..0.02 rows=1 width=32) (actual time=0.412..0.413 rows=1 loops=1)
                           Output: ip_1.pt
                     ->  Index Scan using idx_adm_geometries_bbox on gadm.adm_geometries g  (cost=0.28..648.47 rows=39 width=11354) (actual time=1.690..2.069 rows=15 loops=1)
                           Output: g.geom_hash, g.geom, g.bbox, g.area_sq_m
                           Index Cond: (g.bbox ~ ip_1.pt)
                           Filter: st_contains(g.bbox, ip_1.pt)
                           Buffers: shared hit=66
         ->  CTE Scan on input_point ip  (cost=0.00..0.02 rows=1 width=32) (actual time=0.002..0.002 rows=1 loops=3)
               Output: ip.pt
 Planning:
   Buffers: shared hit=250
 Planning Time: 17.373 ms
 Execution Time: 5.370 ms
(32 rows)

Edge cases

ST_Contains returns false for points lying exactly on a boundary. This is an issue I’ll need to address later. For now, I’m keeping the implementation simple by ensuring the query returns at most one result.

If I wanted to include boundary points, I’d need to use ST_Covers. However, this requires a different optimization approach, since the query could return multiple jurisdictions—for example, where three countries meet.

Other Optimization Ideas

Further improvements are possible by subdividing geometries into chunks of up to 256 points using the dedicated function ST_Subdivide.

log: 8/09/2025