Tag Archives: geocoding

Loading Open Addresses Data into PostGIS

I was working on a geocoding project using my TIGER Geocoder.  The results were terrible for addresses related to university campuses, which is not unreasonable given that buildings can be in the middle of big open spaces with no access to actual roads and many universities (like my alma mater & employer, UC Davis) have made-up addresses that don’t correspond with any real-life location.  What can you do?  If you can’t send your data out to an online service for privacy reasons or you need to process a lot of data on a limited budget, you can work with Open Addresses data.  This post will cover the first step in that process, which is loading data from Open Addresses into a PostGIS table.

Open Addresses is an open dataset that provides point locations for large number of addresses around the world.  The data is downloadable in a folder structure that makes finding data as a human very easy… not so much for my computer though.  If you just need one table, you don’t need to worry.  Just import your table like you would any other.  If you need more than a couple, you’re going to want to automate the import.  This is one way to do that.  I can imagine there are more efficient ways, but this is how I did it.

If you look at the Open Address data, you’ll notice that some tables have city and state names and others do not.  This makes just appending all the tables together not helpful.  If in the future they fill in the tables completely, all this code will not be necessary.  Let’s hope!

Download Data

First, you’ll need to download the data.  I downloaded the five US files (don’t forget that one in the Share-Alike column!) from the Open Addresses data download page.  I don’t recommend doing all of the US, speaking from experience.  It’s a huge dataset so unless you need it all, pick just the states or regions you need.  Whatever you download, put the zips all into one folder, and unzipped them into one folder.

My folder structure looks like this:

  • C:
    • GIS_Data
      • OpenAddresses
        • openaddr-collected-us_midwest
          • summary
          • us
        • openaddr-collected-us_northeast
          • summary
          • us
        • openaddr-collected-us_south
          • summary
          • us
        • openaddr-collected-us_west
          • summary
          • us
        • openaddr-collected-us_west-sa
          • summary
          • us

Inside the Summary and US folders are more subfolders with states and then .csv files inside those.  We don’t need to move anything lower than the region folders.  The code will find all the files we need as long as the region folders are in the same folder.

Importing the Data into PostGIS

I’m assuming you’ve already set up a PostgreSQL/PostGIS database.

I wrote my code for the table import process in R (connecting to my database with the RPostgreSQL package) because it’s a language I am very familiar with.  You could write this in as a .bat file or in python too.  Maybe my code comments would be a decent road map for doing that if you’re happier programming in those languages.

I wrote this code specifically for the US data.  If you’re working with other countries, you may need to adjust some of the code, particularly where it parses out the city and state names from the file paths.

# Load libraries
library(RPostgreSQL)

# Function to capitalize words
# --- We'll need this when we parse the city names from the file names
.simpleCap <- function(x) {
    s <- strsplit(x, " ")[[1]]
    paste(toupper(substring(s, 1, 1)), substring(s, 2),
    sep = "", collapse = " ")
}

# Connect to the database
# --- creates a connection to the postgres database
# --- note that "con" will be used later in each connection to the database
# --- adjust the parameters like dbname, port, user, and password for your specific database
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "DATABASE NAME",
host = "localhost", port = 5432,
user = "USER NAME", password = 'YOUR PASSWORD')

# Directory where you stored your Open Address downloads
dr<-"D:\\GIS_Data\\OpenAddresses"

# Look through your directory for the .csv files and put them into a list
csvs<-list.files(dr, recursive=TRUE, pattern = "csv", full.names = TRUE)

# Remove files from the list that have "summary" in the file name
csvs<-csvs[-grep("summary", csvs)]

#Delete any existing tables with the same name from your database
# --- this is in case you end up running the code a couple of times
# --- while you make changes
dbGetQuery(con, 'DROP TABLE IF EXISTS oa;')

#Create the blank table to put the data from the .csv files into
# --- The columns need to match what is in your .csv files 
dbGetQuery(con, 'create table oa (
    lon numeric,
    lat numeric,
    house_number text,
    street text,
    unit text,
    city text,
    district text,
    region text,
    postcode text,
    id text,
    hash text);')

# Copy data into database 
# --- this walks through the list of tables, importing each
# --- and parsing out the city & state name from the file name
for (i in 1:length(csvs)){
    querytext<-paste("COPY oa FROM '", csvs[i], "' DELIMITER ',' CSV HEADER;", sep = "")
    dbGetQuery(con, querytext)

    #what state is it?
    state<-toupper(strsplit(csvs[i], "/")[[1]][4])
    stateupdate<-paste("UPDATE oa SET region = '", state, "' WHERE region is NULL;", sep="")
    dbGetQuery(con, stateupdate)

    #what city is it?
    city<-gsub(".csv", "", strsplit(csvs[i], "/")[[1]][5])
    city<-gsub("city_of_", "", city)
    city<-gsub("_", " ", city)
    city<-.simpleCap(city)
    cityupdate<-paste("UPDATE oa SET city = '", city, "' WHERE city is NULL;", sep="")
    dbGetQuery(con, cityupdate)
}

# Remove "statewide" from the state column 
dbGetQuery(con, "UPDATE oa SET city = NULL WHERE city like 'Statewide';")

The results should be one table in your database called oa that contains all of the Open Address points in the files you downloaded. Now you can use that table to match your list of addresses. I ended up comparing the addresses I needed to geocode with the Open Addresses data using the QGram metric that tells you how similar two strings are. Hopefully soon I’ll have a post with the code I used to find the most similar Open Addresses match to the list of addresses I needed to geocode.


Batch Geocoding with PostGIS

So you’ve got a table of addresses that need to be matched up with their latitude/longitude coordinates… i.e. you need to geocode some addresses.  Most of the time, an online tool is a better choice than running your own geocoder if all you need is a handful of addresses geolocated.  In the case that you can’t share your addresses or you have too many to send through an online tool, batch geocoding on your own computer is a good option.  In this post I’ll share how I’ve been doing this.

First, you’ll need a geocoder.  If you’ve got a PostGIS Geocoder up an running, congratulations!  That’s no easy feat.  If you don’t, take a detour over to my post about how to make a PostGIS geocoder in less than 5 days.

Once you’ve made your geocoder, you’re probably fairly familiar with how to run queries in pgAdmin 4, but just to be on the safe side, I’m gonna give you all the details I think you’ll need.

Start pgAdmin 4.  (If you get an error, you may need to start the postgresql service – see link above for details).  Navigate in the browser panel on the left side of the window to your geocoder database and click on it to highlight it.  Open the Query Tool – Tools menu at the top, then select Query Tool.

This is the code I’ve been running (with copious code comments to help you… and me later… understand what I did).  You’ll need to make adjustments for your specific table of addresses, but after that, you can paste the whole thing into the Query Tool and hit the run button (why is it lightning bolt icon?).

--- remove any existing attempts
DROP TABLE IF EXISTS address_table;

--- Create an empty table
--- Make the columns match those of the CSV that has the data 
--- you want to geocode

CREATE TABLE address_table(
per_id_fk varchar(255), 
acity varchar(255), 
astate varchar(255),
azip varchar(255),
acounty varchar(255),
adate varchar(255),
street varchar(255),
input_address varchar(255));

--- Look at your new empty table:   
--- (I like to have visual confirmation that things worked the first time 
--- I run them... you don't need to do this when you run the whole script 
--- to batch process the address)
--- Uncomment the SELECT query below if you're doing this one piece at a time:
--- SELECT * FROM address_table;

--- import data from a CSV - the columns in the CSV should be in the list above
COPY address_table from 'C:\gisdata\TableOfAddresses.csv' WITH DELIMITER ',' NULL 'NA' CSV HEADER;

--- Look at the data you added:
--- SELECT * FROM address_table;

--- Add the columns we'll need for geocoding
ALTER TABLE address_table 
ADD lon numeric,  
ADD lat numeric,  
ADD geomout geometry, -- a point geometry in NAD 83 long lat. 
ADD new_address varchar(255),  
ADD rating integer;

--- Look at your new empty columns:
--- SELECT * FROM address_table;

--- This function loops through the table, one row at a time:
CREATE OR REPLACE FUNCTION mygeocoder()
RETURNS void
AS $$
BEGIN 
   FOR i IN 1..(SELECT(count(per_id_fk)) from address_table) LOOP 
      UPDATE address_table 
         SET  (rating, new_address, lon, lat, geomout) = (COALESCE(g.rating,-1),pprint_addy(g.addy), ST_X(g.geomout)::numeric(8,5), ST_Y(g.geomout)::numeric(8,5), (g.geomout) ) 
         FROM (SELECT per_id_fk, input_address FROM address_table WHERE rating IS NULL ORDER BY per_id_fk LIMIT 1) 
         As a 
      LEFT JOIN LATERAL geocode(a.input_address,1) As g ON true 
      WHERE a.per_id_fk = address_table.per_id_fk; 
   END LOOP; 
   RETURN;
END;
$$ LANGUAGE plpgsql;

--- Run your geocoder to geocode your address table:
SELECT mygeocoder();

--- Look at your table with geocoded addresses:
SELECT * FROM address_table;

As you can see in the code comments, this function geocodes the table one row at a time.  According to posts I’ve read, you don’t want to feed your whole table into the geocoder at once.  It makes the computer hold a lot of data in memory and will slow things down.  You can do a few lines at a time, but it makes the code for the loop easier if I didn’t have to figure out how many times to loop (… how many rows do I have… divide by 3… how to know when to stop…).

This is going to take a long time if you have many addresses.  For testing purposes, I highly recommend making a subset of your data – maybe 100 or so addresses – to work on while you adapt your code.  You don’t want to wait hours just to find out something was a little off.

In formulating this post and code, I found these resources to be very helpful and have adapted most of my code from them: