Loading Open Addresses Data into PostGIS

I was working on a geocoding project using my TIGER Geocoder.  The results were terrible for addresses related to university campuses, which is not unreasonable given that buildings can be in the middle of big open spaces with no access to actual roads and many universities (like my alma mater & employer, UC Davis) have made-up addresses that don’t correspond with any real-life location.  What can you do?  If you can’t send your data out to an online service for privacy reasons or you need to process a lot of data on a limited budget, you can work with Open Addresses data.  This post will cover the first step in that process, which is loading data from Open Addresses into a PostGIS table.

Open Addresses is an open dataset that provides point locations for large number of addresses around the world.  The data is downloadable in a folder structure that makes finding data as a human very easy… not so much for my computer though.  If you just need one table, you don’t need to worry.  Just import your table like you would any other.  If you need more than a couple, you’re going to want to automate the import.  This is one way to do that.  I can imagine there are more efficient ways, but this is how I did it.

If you look at the Open Address data, you’ll notice that some tables have city and state names and others do not.  This makes just appending all the tables together not helpful.  If in the future they fill in the tables completely, all this code will not be necessary.  Let’s hope!

Download Data

First, you’ll need to download the data.  I downloaded the five US files (don’t forget that one in the Share-Alike column!) from the Open Addresses data download page.  I don’t recommend doing all of the US, speaking from experience.  It’s a huge dataset so unless you need it all, pick just the states or regions you need.  Whatever you download, put the zips all into one folder, and unzipped them into one folder.

My folder structure looks like this:

  • C:
    • GIS_Data
      • OpenAddresses
        • openaddr-collected-us_midwest
          • summary
          • us
        • openaddr-collected-us_northeast
          • summary
          • us
        • openaddr-collected-us_south
          • summary
          • us
        • openaddr-collected-us_west
          • summary
          • us
        • openaddr-collected-us_west-sa
          • summary
          • us

Inside the Summary and US folders are more subfolders with states and then .csv files inside those.  We don’t need to move anything lower than the region folders.  The code will find all the files we need as long as the region folders are in the same folder.

Importing the Data into PostGIS

I’m assuming you’ve already set up a PostgreSQL/PostGIS database.

I wrote my code for the table import process in R (connecting to my database with the RPostgreSQL package) because it’s a language I am very familiar with.  You could write this in as a .bat file or in python too.  Maybe my code comments would be a decent road map for doing that if you’re happier programming in those languages.

I wrote this code specifically for the US data.  If you’re working with other countries, you may need to adjust some of the code, particularly where it parses out the city and state names from the file paths.

# Load libraries
library(RPostgreSQL)

# Function to capitalize words
# --- We'll need this when we parse the city names from the file names
.simpleCap <- function(x) {
    s <- strsplit(x, " ")[[1]]
    paste(toupper(substring(s, 1, 1)), substring(s, 2),
    sep = "", collapse = " ")
}

# Connect to the database
# --- creates a connection to the postgres database
# --- note that "con" will be used later in each connection to the database
# --- adjust the parameters like dbname, port, user, and password for your specific database
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "DATABASE NAME",
host = "localhost", port = 5432,
user = "USER NAME", password = 'YOUR PASSWORD')

# Directory where you stored your Open Address downloads
dr<-"D:\\GIS_Data\\OpenAddresses"

# Look through your directory for the .csv files and put them into a list
csvs<-list.files(dr, recursive=TRUE, pattern = "csv", full.names = TRUE)

# Remove files from the list that have "summary" in the file name
csvs<-csvs[-grep("summary", csvs)]

#Delete any existing tables with the same name from your database
# --- this is in case you end up running the code a couple of times
# --- while you make changes
dbGetQuery(con, 'DROP TABLE IF EXISTS oa;')

#Create the blank table to put the data from the .csv files into
# --- The columns need to match what is in your .csv files 
dbGetQuery(con, 'create table oa (
    lon numeric,
    lat numeric,
    house_number text,
    street text,
    unit text,
    city text,
    district text,
    region text,
    postcode text,
    id text,
    hash text);')

# Copy data into database 
# --- this walks through the list of tables, importing each
# --- and parsing out the city & state name from the file name
for (i in 1:length(csvs)){
    querytext<-paste("COPY oa FROM '", csvs[i], "' DELIMITER ',' CSV HEADER;", sep = "")
    dbGetQuery(con, querytext)

    #what state is it?
    state<-toupper(strsplit(csvs[i], "/")[[1]][4])
    stateupdate<-paste("UPDATE oa SET region = '", state, "' WHERE region is NULL;", sep="")
    dbGetQuery(con, stateupdate)

    #what city is it?
    city<-gsub(".csv", "", strsplit(csvs[i], "/")[[1]][5])
    city<-gsub("city_of_", "", city)
    city<-gsub("_", " ", city)
    city<-.simpleCap(city)
    cityupdate<-paste("UPDATE oa SET city = '", city, "' WHERE city is NULL;", sep="")
    dbGetQuery(con, cityupdate)
}

# Remove "statewide" from the state column 
dbGetQuery(con, "UPDATE oa SET city = NULL WHERE city like 'Statewide';")

The results should be one table in your database called oa that contains all of the Open Address points in the files you downloaded. Now you can use that table to match your list of addresses. I ended up comparing the addresses I needed to geocode with the Open Addresses data using the QGram metric that tells you how similar two strings are. Hopefully soon I’ll have a post with the code I used to find the most similar Open Addresses match to the list of addresses I needed to geocode.

Advertisements

About micheletobias

I lead two lives - one as an artist and the other as a scientist. More and more I'm finding my two worlds colliding, and it's not the disaster you might expect. View all posts by micheletobias

One response to “Loading Open Addresses Data into PostGIS

  • Jeffrey Bradley

    Great posts! I was able to set up a tiger geocode following your other tutorials and then I decided to load openaddresses for my area. I am now using full text search along with Paul Ramsey addressing dictionary and pg_trgm, with my tiger geocoder so I am able to blast through almost all my addresses.

I'd love to hear what you think. Leave a comment!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: