I moved handlers to dedicated package.
I added publicly available token creation endpoint with 2 seconds rate limiter.
I moved all database interactions to separate package. Each resource now has it’s own repository.
Refactoring in progress. I am moving database calls to separate go-package.
I deployed the docs pages!
Additionally I purchased a domain. The API is now available under api.worldlines.dev.
In a week, I’ll be leaving for a long holiday. Until then, I need to wrap up a few loose ends. I expect to finish a beta version of the GADM API before I go.
Remaining tasks:
- Buy a domain and link it to the server
- Create a static website for documentation
- Add an endpoint for access token creation
Today, I started working on the documentation site. I’ll be using Hugo and GitHub Pages — a simple and fast setup that fits the project’s needs for now.
The main goal of the refactoring has been achieved — all GADM endpoints now rely on the new Postgres tables. Along the way, I removed a significant amount of code tied to the old tables. The next step in improving the codebase will be to further separate database logic from the main package, making the structure cleaner and easier to maintain.
I continue working on leveraging adm
and adm_geometries
tables for all endpoints
I’ve found, fixed and deployed fix to the reverse geocode sql query bug.
I started refactoring the code – first step is to extract database “stuff” from main package. I will probably get inspired by repository patter from DDD.
I started working on leveraging adm
and adm_geometries
tables for all existing endpoints. This is required step before I dedupe GADM information stored in the database
Reverse geocode endpoint is deployed to production 🙂
I implemented the reverse geocode endpoint.
Once I deploy new endpoint I will take some time to refactor the code. I need to better organize the packages and get rid of duplicated GADM data.
Today I completed the setup for reverse geocoding. The structure is built around two tables:adm
– contains all GADM levels, but without their geometries.adm_geometries
– holds all unique GADM geometries.
The two tables are linked via a geometry hash. Original GADM properties are stored in the adm.metadata
column as JSONB. The adm_geometries
table also includes extra columns — area_sq_m
and bbox
— to speed up spatial queries.
The Query
Reverse geocoding will be implemented using ST_Contains
.
My initial query looked like this:
EXPLAIN (ANALYZE, VERBOSE, BUFFERS, TIMING, WAL) WITH input_point AS (
SELECT ST_SetSRID(ST_MakePoint(2.3216141804549526, 48.85511355252443), 4326)::geometry(Point,4326) AS pt
),
ag AS (
SELECT g.geom_hash, g.area_sq_m
FROM gadm.adm_geometries AS g
CROSS JOIN input_point ip
WHERE ST_Contains(g.geom, ip.pt)
ORDER BY g.area_sq_m ASC
LIMIT 1
)
SELECT
a.lv,
a.metadata,
a.geom_hash,
ag.area_sq_m
FROM gadm.adm AS a
JOIN ag ON a.geom_hash = ag.geom_hash
ORDER BY a.lv DESC
LIMIT 1;
After running EXPLAIN ANALYZE
, I found the query took about 170 ms — far too slow. The bottleneck was ST_Contains
being called on geometries from all GADM levels. I had expected a point to match only three to five boundaries, depending on the number of levels in a given country. However, the analysis revealed that for my test point in Paris, ST_Contains
was executed 15 times.
I needed a way to reduce the number of ST_Contains
calls. But first, I wanted to measure how performance varied across different GADM levels. My assumption was that the country level would be the slowest, with each subsequent, smaller level executing faster.
The assumption proved correct — the country-level check was the slowest, at around 150 ms. Surprisingly, all levels below the country performed similarly, taking only about 2.5–5 ms each – not too bad!
This confirmed my initial plan: first, find all geometries whose bounding box contains the point. Then, order them by area and select the smallest one whose geometry actually contains the point.
WITH input_point AS (
SELECT ST_SetSRID(ST_MakePoint(2.3216141804549526, 48.85511355252443), 4326)::geometry(Point,4326) AS pt
),
candidates AS (
SELECT g.geom_hash, g.geom, g.area_sq_m
FROM gadm.adm_geometries AS g
CROSS JOIN input_point ip
WHERE ST_Contains(g.bbox, ip.pt)
ORDER BY g.area_sq_m ASC
)
SELECT c.geom_hash, c.area_sq_m
FROM candidates c
CROSS JOIN input_point ip
WHERE ST_Contains(c.geom, ip.pt)
ORDER BY c.area_sq_m ASC
LIMIT 1;
This change cut the execution time down to 5–10 ms, which is an acceptable result for the initial implementation. It’s a >10x improvement!
Limit (cost=2548.72..7448.94 rows=1 width=33) (actual time=3.869..3.930 rows=1 loops=1)
Output: g.geom_hash, g.area_sq_m
Buffers: shared hit=69
CTE input_point
-> Result (cost=0.00..0.01 rows=1 width=32) (actual time=0.088..0.092 rows=1 loops=1)
Output: '0101000020E6100000F4DE9774AA920240BA3A635C746D4840'::geometry(Point,4326)
-> Nested Loop (cost=2548.71..7448.93 rows=1 width=33) (actual time=3.769..3.771 rows=1 loops=1)
Output: g.geom_hash, g.area_sq_m
Join Filter: st_contains(g.geom, ip.pt)
Rows Removed by Join Filter: 2
Buffers: shared hit=69
-> Sort (cost=2548.71..2549.69 rows=391 width=11234) (actual time=3.432..3.434 rows=3 loops=1)
Output: g.geom_hash, g.geom, g.area_sq_m
Sort Key: g.area_sq_m
Sort Method: quicksort Memory: 32kB
Buffers: shared hit=69
-> Nested Loop (cost=0.28..648.88 rows=391 width=11234) (actual time=2.291..2.674 rows=15 loops=1)
Output: g.geom_hash, g.geom, g.area_sq_m
Buffers: shared hit=66
-> CTE Scan on input_point ip_1 (cost=0.00..0.02 rows=1 width=32) (actual time=0.412..0.413 rows=1 loops=1)
Output: ip_1.pt
-> Index Scan using idx_adm_geometries_bbox on gadm.adm_geometries g (cost=0.28..648.47 rows=39 width=11354) (actual time=1.690..2.069 rows=15 loops=1)
Output: g.geom_hash, g.geom, g.bbox, g.area_sq_m
Index Cond: (g.bbox ~ ip_1.pt)
Filter: st_contains(g.bbox, ip_1.pt)
Buffers: shared hit=66
-> CTE Scan on input_point ip (cost=0.00..0.02 rows=1 width=32) (actual time=0.002..0.002 rows=1 loops=3)
Output: ip.pt
Planning:
Buffers: shared hit=250
Planning Time: 17.373 ms
Execution Time: 5.370 ms
(32 rows)
Edge cases
ST_Contains
returns false for points lying exactly on a boundary. This is an issue I’ll need to address later. For now, I’m keeping the implementation simple by ensuring the query returns at most one result.
If I wanted to include boundary points, I’d need to use ST_Covers
. However, this requires a different optimization approach, since the query could return multiple jurisdictions—for example, where three countries meet.
Other Optimization Ideas
Further improvements are possible by subdividing geometries into chunks of up to 256 points using the dedicated function ST_Subdivide
.
I analyzed geometry duplication across GADM levels and found over 9,000 duplicate geometries. This duplication often occurs in small countries, such as the Vatican, or in islands. To verify whether geometries were truly duplicated, I compared their hashes, point counts, and areas — an approach that should eliminate hash collisions.
-- =======================================================================
-- QUERY 3: Check for potential MD5 collisions by comparing geometry equality
-- =======================================================================
WITH all_geometries AS (
SELECT
md5_geom_binary_hash as hash,
'adm_0' as table_name,
geom,
gid_0 as gid,
country as name
FROM adm_0
UNION ALL
SELECT
md5_geom_binary_hash as hash,
'adm_1' as table_name,
geom,
gid_1 as gid,
name_1 as name
FROM adm_1
UNION ALL
SELECT
md5_geom_binary_hash as hash,
'adm_2' as table_name,
geom,
gid_2 as gid,
name_2 as name
FROM adm_2
UNION ALL
SELECT
md5_geom_binary_hash as hash,
'adm_3' as table_name,
geom,
gid_3 as gid,
name_3 as name
FROM adm_3
UNION ALL
SELECT
md5_geom_binary_hash as hash,
'adm_4' as table_name,
geom,
gid_4 as gid,
name_4 as name
FROM adm_4
UNION ALL
SELECT
md5_geom_binary_hash as hash,
'adm_5' as table_name,
geom,
gid_5 as gid,
name_5 as name
FROM adm_5
-- Add other tables as needed for comprehensive check
),
hash_groups AS (
SELECT
hash,
COUNT(*) as count,
array_agg(DISTINCT table_name) as tables,
array_agg(name) as names
FROM all_geometries
GROUP BY hash
HAVING COUNT(*) > 1
),
collision_check AS (
SELECT
hg.hash,
hg.count,
hg.tables,
hg.names,
-- Check if all geometries with same hash are actually equal
COUNT(DISTINCT ST_AsText(ag.geom)) as unique_geometries_count
FROM hash_groups hg
JOIN all_geometries ag ON hg.hash = ag.hash
GROUP BY hg.hash, hg.count, hg.tables, hg.names
)
SELECT
hash,
count as duplicate_count,
tables,
names,
unique_geometries_count,
CASE
WHEN unique_geometries_count = 1 THEN 'TRUE DUPLICATE'
ELSE 'POTENTIAL MD5 COLLISION!'
END as analysis
FROM collision_check
ORDER BY unique_geometries_count DESC, duplicate_count DESC;
An unexpected benefit of this situation is that I can store all geometries in one place, remove duplicates, and save space. One option is to maintain a single geometries
table with a one-to-many relationship to the original GADM tables. However, this makes it harder to map back from adm_geometries
to the lowest-level adm_<nr>
tables.
My next experiment is to merge all adm_
tables into a single table, with metadata stored in a JSON/JSONB column.
table: adm
- lv: int
- metadata: JSON
- geometry_hash // md5 hash
table: adm_geometries
- geometry_hash: text // md5 hash
- geom: GEOMETRY(POLYGON)
That’s it for today!
I started working on reverse geocoding endpoint. I decided I will store the geometries in a single table adm_geometries
. This should help optimizing the st_within
query across GADM levels. The id of a geometry will be calculated as md5 hash of its shape.
Table "gadm.adm_geometries"
Column | Type | Collation | Nullable | Default
--------+-----------------------------+-----------+----------+---------
id | text | | not null |
geom | geometry(MultiPolygon,4326) | | not null |
Indexes:
"adm_geometries_pkey" PRIMARY KEY, btree (id)
"idx_adm_geometries_geom" gist (geom)
However I encountered a problem – there are duplicated md5 hashes across GADM levels. This is unexpected! Does it mean that certain states fully overlap with their countries?
I will figure it out tomorrow … hopefully
I polished and refined docker compose files. Additionally I tested if “up” migrations work and intended and I added capability to run “down” migrations.
Horay!
I finally deployed first version of my GADM api service. It is very rough at the moment. The deployment process is semi-automated. I can initialize new ubuntu VM, launch dockerized server and database, ingest GADM geopackage and run migrations with just few manual interventions.
Some bash script that I wrote will help me to automate the process even further with github CI/CD pipeline.
At the moment I am happy I can ping my server from local machine and confirm that pagination and filtering works as intended for all GADM levels!