Batch Geocoding with PostGIS

So you’ve got a table of addresses that need to be matched up with their latitude/longitude coordinates… i.e. you need to geocode some addresses.  Most of the time, an online tool is a better choice than running your own geocoder if all you need is a handful of addresses geolocated.  In the case that you can’t share your addresses or you have too many to send through an online tool, batch geocoding on your own computer is a good option.  In this post I’ll share how I’ve been doing this.

First, you’ll need a geocoder.  If you’ve got a PostGIS Geocoder up an running, congratulations!  That’s no easy feat.  If you don’t, take a detour over to my post about how to make a PostGIS geocoder in less than 5 days.

Once you’ve made your geocoder, you’re probably fairly familiar with how to run queries in pgAdmin 4, but just to be on the safe side, I’m gonna give you all the details I think you’ll need.

Start pgAdmin 4.  (If you get an error, you may need to start the postgresql service – see link above for details).  Navigate in the browser panel on the left side of the window to your geocoder database and click on it to highlight it.  Open the Query Tool – Tools menu at the top, then select Query Tool.

This is the code I’ve been running (with copious code comments to help you… and me later… understand what I did).  You’ll need to make adjustments for your specific table of addresses, but after that, you can paste the whole thing into the Query Tool and hit the run button (why is it lightning bolt icon?).

--- remove any existing attempts
DROP TABLE IF EXISTS address_table;

--- Create an empty table
--- Make the columns match those of the CSV that has the data 
--- you want to geocode

CREATE TABLE address_table(
per_id_fk varchar(255), 
acity varchar(255), 
astate varchar(255),
azip varchar(255),
acounty varchar(255),
adate varchar(255),
street varchar(255),
input_address varchar(255));

--- Look at your new empty table:   
--- (I like to have visual confirmation that things worked the first time 
--- I run them... you don't need to do this when you run the whole script 
--- to batch process the address)
--- Uncomment the SELECT query below if you're doing this one piece at a time:
--- SELECT * FROM address_table;

--- import data from a CSV - the columns in the CSV should be in the list above
COPY address_table from 'C:\gisdata\TableOfAddresses.csv' WITH DELIMITER ',' NULL 'NA' CSV HEADER;

--- Look at the data you added:
--- SELECT * FROM address_table;

--- Add the columns we'll need for geocoding
ALTER TABLE address_table 
ADD lon numeric,  
ADD lat numeric,  
ADD geomout geometry, -- a point geometry in NAD 83 long lat. 
ADD new_address varchar(255),  
ADD rating integer;

--- Look at your new empty columns:
--- SELECT * FROM address_table;

--- This function loops through the table, one row at a time:
CREATE OR REPLACE FUNCTION mygeocoder()
RETURNS void
AS $$
BEGIN 
   FOR i IN 1..(SELECT(count(per_id_fk)) from address_table) LOOP 
      UPDATE address_table 
         SET  (rating, new_address, lon, lat, geomout) = (COALESCE(g.rating,-1),pprint_addy(g.addy), ST_X(g.geomout)::numeric(8,5), ST_Y(g.geomout)::numeric(8,5), (g.geomout) ) 
         FROM (SELECT per_id_fk, input_address FROM address_table WHERE rating IS NULL ORDER BY per_id_fk LIMIT 1) 
         As a 
      LEFT JOIN LATERAL geocode(a.input_address,1) As g ON true 
      WHERE a.per_id_fk = address_table.per_id_fk; 
   END LOOP; 
   RETURN;
END;
$$ LANGUAGE plpgsql;

--- Run your geocoder to geocode your address table:
SELECT mygeocoder();

--- Look at your table with geocoded addresses:
SELECT * FROM address_table;

As you can see in the code comments, this function geocodes the table one row at a time.  According to posts I’ve read, you don’t want to feed your whole table into the geocoder at once.  It makes the computer hold a lot of data in memory and will slow things down.  You can do a few lines at a time, but it makes the code for the loop easier if I didn’t have to figure out how many times to loop (… how many rows do I have… divide by 3… how to know when to stop…).

This is going to take a long time if you have many addresses.  For testing purposes, I highly recommend making a subset of your data – maybe 100 or so addresses – to work on while you adapt your code.  You don’t want to wait hours just to find out something was a little off.

In formulating this post and code, I found these resources to be very helpful and have adapted most of my code from them:

 

Advertisements

About micheletobias

I lead two lives - one as an artist and the other as a scientist. More and more I'm finding my two worlds colliding, and it's not the disaster you might expect. View all posts by micheletobias

5 responses to “Batch Geocoding with PostGIS

  • MB4545

    I recently implemented the Postgres DB based on your blog post “How to make a PostGIS TIGER Geocoder in Less than 5 Days”. I’m comparing geocoded lat/longs for a single address, 14721 CENTER AVE HARVEY IL 60426, across the following 3 sources:
    #1. PostgresDB installed on your instructions
    #2. Google Maps
    #3. Census.gov geocoder: https://geocoding.geo.census.gov/geocoder/geographies/address?form

    Based on the 3 sources above I received the following coordinates:
    #1. PostgresDB: 41.60758,-87.65199
    #2. Google Maps: 41.6217219,-87.6462478
    #3. Census.gov geocoder: 41.622158,-87.64625

    I use a vincenty distance calculation as a method to validate the coordinates.
    1 vs 2: 1.020 mi
    1 vs 3: 1.049 mi
    2 vs 3: 0.030 mi

    Based on the above, it seems the Google Maps & Census.gov geocoders are very closely aligned. The coordinates from the PostgresDB are a mile away. Do you have any recommendations on how to tweak the Postgres DB?

    • micheletobias

      The differences are due to differing input data and estimation algorithms. My suspicion is that Google may have access to parcels, which would get a closer match. I’m not a geocoding expert, just a person who had to implement a tool, found it difficult, and then wrote up some more clear instructions when I got it working. If you can use the other options, by all means, do that. The PostGIS option is good for when you can’t share your addresses with a 3rd party, such as I frequently have in my work with the UCD medical campus.

  • Steve LeClair

    Hi Michele, the code above worked great. I’ve done little with PostgreSQL over the years so it was nice to have a solid start!

    Looking at our addresses, I see a lot of opportunities for improvement in data quality. One future project is to embed a canonical source of addresses into our medical record which (over time) would ensure decent data quality.

    Until then, we need to do what we can with various fuzzy search/matching algorithms. Do you have any experience/recommendations in this area?

    Thanks again!
    …Steve

    • micheletobias

      I’m just learning about fuzzy matching tools. You could certainly tweak the geocoder function I wrote to match with a fuzzy matching tool. Look at the fuzzymatch and pg_trgm modules for postgres. The levenshtein() function is pretty straight forward – number of differences between strings. The similarity() function gives you the number of 3 letter matches between words and seems a little better for knowing that things like ‘400 Main St.’ is much more different than ‘900 Main St.’ as compared with ‘401 Main St.’

  • Steve LeClair

    Hi Michele,
    Thanks for the fuzzy hints.

    BTW, in order to get through my 68k of addresses I needed to add a simple exception handler to the function you gave above (thanks again):

    CREATE OR REPLACE FUNCTION mygeocoder()
    RETURNS void
    AS $$
    BEGIN
    FOR i IN 1..(SELECT(count(per_id_fk)) from address_table) LOOP
    BEGIN
    UPDATE address_table
    SET (rating, new_address, lon, lat, geomout) = (COALESCE(g.rating,-1),pprint_addy(g.addy), ST_X(g.geomout)::numeric(8,5), ST_Y(g.geomout)::numeric(8,5), (g.geomout) )
    FROM (SELECT per_id_fk, input_address FROM address_table WHERE rating IS NULL ORDER BY per_id_fk LIMIT 1)
    As a
    LEFT JOIN LATERAL geocode(a.input_address,1) As g ON true
    WHERE a.per_id_fk = address_table.per_id_fk;
    –Added Exception handler
    EXCEPTION
    WHEN SQLSTATE ‘2201B’ THEN
    UPDATE address_table
    SET rating = -1
    FROM (SELECT per_id_fk, input_address FROM address_table WHERE rating IS NULL ORDER BY per_id_fk LIMIT 1)
    As a
    WHERE a.per_id_fk = address_table.per_id_fk;
    END;
    END LOOP;
    RETURN;
    END;
    $$ LANGUAGE plpgsql;

    The exceptions turned out to be PO box addresses, null addresses and addresses outside of the state that I loaded. The SQLState 2201B was the only error I encountered (“invalid regular expression”…).

    No doubt there are more elegant PostgreSQL solutions but as a noob with this database my simple addition allowed the batch to complete.

I'd love to hear what you think. Leave a comment!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: