From a couple of months, I have been involved in Tax certificate address verification project for educational purpose; I have around 8 million parcel address collected and maintained by different tax collecting agencies in various format for instance COBOL ,CSV, ASCII, Excel , and some are even in pdf for their tax districts.
My task was to find the address from those different files and export them in CSV with corrected standard USPS address format (STREET, CITY, STATE, and ZIP) to make Universal Tax File format for data analysis by removing unwanted data from the address rows. Initially, I used ZP4 software, it has “official United States Postal Service data files on a single DVD-ROM that provides a powerful tool for automatically determining the correct mailing address, ZIP + 4 code, and mail carrier route number for any location in the United States”, works fine for few address (about 15%) which has proper USPS standard address format. The real challenge starts with remaining 85% of the parcel address that has no USPS standard address format (either address without city and zip information or the vacant parcel lots that cannot accept mails and therefore not entered in the USPS mailing address database).
Looking around the Web, Google leaves me with tons of pages entitled on address correction and validation including close source and open source software APIs, tools, and datasets, but hard to figure out which method works for my project in terms of address quality(address was poorly written), quantity(8 million addresses), and resources (cheap and easy to implement) . After spending few days here and there, finally I come up with a solution that works for me, an integrated address correction strategy using ZP4, Google Geocoding API, and Bing Maps API.
Here, I am going to share how ZP4, Google and Bing integration makes my job easy and relatively cheap to validate and correct US physical street addresses.
ZP4: Address and postal data on DVD-ROM
Input format- House #, Street, City, State, and Zip (Red: Mandatory fields, Blue: At least two mandatory fields out of three)
Response format- House #, Street, City, State, Zip, Zip4, Lat/Lon, CITY STATE ZIP, and various others fields
Type- DVD- ROM with address correction ZP4.dll
Cost-$99
License- Commercial
Address source- USPS address
Provider- Semaphore Corporation
Interface: Window interface and the DVD comes with demo program showing how to send and receive address records from C, C#, Delphi, VB, and so forth.
Limitations:
a. Works only address in USPS postal databases
b. Must provide at least two fields out of City, State ,and Zip
c. Limited addresses look up for correction and verification
Google - The Google Geocoding API
Input format- House #, Street, City, State, and Zip (Red: Mandatory fields, Blue: Not mandatory, recommends to provide for confidence on parsed address)
Response format- JSON, XML /Corrected fields-House #, Street, City, State, Zip
Type- REST API Protocol
Cost- Trial Free (2500 Geocode API request/ IP address within 24 hrs), Commercial- starting with $10000 (100,000 requests per day)
License- Trial, Commercial
Address source- Multiple sources
Provider- Google
Interface: No Window interface, but good documentation on using API for addresses geocoding.
Bing- Locations API
Input format- House #, Street, City, State, and Zip (Red: Mandatory fields, Blue: Not mandatory, recommends to provide for confidence on parsed address)
Response format- JSON, XML/ Corrected fields-House #, Street, City, State, Zip
Type- REST API Protocol
Cost- Trial Free (50,000 Geocode API request/ Bing Maps API key with in 24hrs) for 90 days. Can generate up to 3 trail keys from a Microsoft live account that allows you to verify 150000 address/24hrs for 90 days.
License- Trial, Commercial
Address source- Multiple sources
Provider- Microsoft
Interface: No Window interface, but good documentation on using API for addresses geocoding.
Both Google and Bing APIs has limitations in terms of API key limit and can't correct address with 100% accuracy. However, helps a lot for address correction.
Another possible Geocoding API : Yahoo! Boss Geo PlaceFinder
Input format- House #, Street, City, State, and Zip (Red: Mandatory fields, Blue: Not mandatory, recommends to provide for confidence on parsed address)
Response format- JSON, XML/ Corrected fields-House #, Street, City, State, Zip
Type- REST API Protocol
Geocoding API Examples: http://developer.yahoo.com/boss/geo/docs/examples-placefinder.html
Among three address correction tools/APIs Bing works for me well than ZP4 and GOOGLE in terms of finding missing city and zip. ZP4 works well for the STANDARD USPS address datasets and prevents from calling Geocoding API even for the standard addresses; the API key is the most important and limited resource for my educational trail project.
Integrated Address Correction Code Snippets in VB
'Parse Address
Function TestRegExp(myPattern As String, myString As String) As Boolean
'Create objects.
Dim objRegExp As Regex
' Create a regular expression object.
objRegExp = New Regex(myPattern)
'Test whether the String can be compared.
Dim match As Match = objRegExp.Match(myString)
If match.Success Then
Return True
Else
Return False
End If
End Function
Private Function getLastIndexOf(ByVal sArray As String(), ByVal checkArray As String()) As Integer
Dim sourceIndex As Integer = 0
Dim outputIndex As Integer = 0
For Each item As String In checkArray
For Each source As String In sArray
If source.ToLower = item.ToLower Then
outputIndex = sourceIndex
If item.ToLower = "box" Then
outputIndex = outputIndex + 1
End If
End If
sourceIndex = sourceIndex + 1
Next
sourceIndex = 0
Next
Return outputIndex
End Function
Public Function parseAddress(ByVal input As String) As Collection
Dim output As New Collection
If (Regex.IsMatch(input, "^[0-9 ]+$")) Then
Console.WriteLine("Only numbers in the Address field")
output.Add("", "Address1")
output.Add("", "Address2")
output.Add("", "City")
output.Add("", "State")
output.Add("", "Zip")
Return output
Else
input = input.Replace(",", "")
input = input.Replace(" ", " ")
input = input.Replace(" CR ", " County Road ")
input = input.Replace(" SR ", " State Road ")
Dim splitString() As String = Split(input)
Dim streetMarker() As String = New String() {"street", "County Road", "Sate Road", "st", "st.", "avenue", "ave", "ave e", "ave.", "blvd", "blvd.", "highway", "hwy", "hwy.", "box", "road", "plz", "madera", "walk", "bend", "rd", "rd.", "lane", "ln", "ln.", "circle", "circ", "cir", "circ.", "court", "ct", "ter", "ct.", "dr", "trl", "pl", "sr", "cr", "gln", "loop", "way", "xing", "us", "cv", "run", "bluff", "pky", "trce", "palm", "trace", "business", "fairway", "broadway", "island", "manor", "overlook", "oak", "majestic"}
Dim address1 As String
Dim address2 As String = ""
Dim city As String
Dim state As String
Dim zip As String
Dim streetMarkerIndex As Integer
Dim flag As Boolean
zip = splitString(splitString.Length - 1).ToString()
flag = TestRegExp("^\d{5}(?:[-\s]\d{4})?$", zip)
If flag = True Then
state = splitString(splitString.Length - 2).ToString()
streetMarkerIndex = getLastIndexOf(splitString, streetMarker) + 1
Dim sb As New StringBuilder
For counter As Integer = streetMarkerIndex To splitString.Length - 3
sb.Append(splitString(counter) + " ")
Next counter
city = RTrim(sb.ToString())
Dim addressIndex As Integer = 0
For counter As Integer = 0 To streetMarkerIndex
If IsNumeric(splitString(counter)) _
Or splitString(counter).ToString.ToLower = "po" _
Or splitString(counter).ToString().ToLower().Replace(".", "") = "po" Then
addressIndex = counter
Exit For
End If
Next counter
sb = New StringBuilder
For counter As Integer = addressIndex To streetMarkerIndex - 1
sb.Append(splitString(counter) + " ")
Next counter
address1 = RTrim(sb.ToString())
output.Add(address1, "Address1")
output.Add(city, "City")
output.Add(state, "State")
output.Add(zip, "Zip")
Return output
Else
output.Add(input, "Address1")
output.Add("", "Address2")
output.Add("", "City")
output.Add("", "State")
output.Add("", "Zip")
Return output
End If
End If
End Function
'Correct address using ZP4 software
Public Function parseZP4(ByVal address As String, ByVal city As String, ByVal state As String) As StringBuilder
Dim addstring As String
Dim session As Integer = 0
ZP4.ZP4StartSession(session)
If (session = 0) Then
Console.WriteLine("Session allocation failed!")
End If
'ZP4.ZP4TimeLimit(session, 0)
If (ZP4.ZP4InputOrder(session, "Address" & vbTab & "City" & vbTab & "State") = 0) Then
Console.WriteLine("Invalid input list!")
End If
If (ZP4.ZP4OutputOrder(session, "Address (final)" & vbTab & "City (final)" & vbTab & "State (final)" & vbTab & "ZIP (final)" & vbTab & "ZIP (five-digit)" & vbTab & "ZIP (four-digit add-on)" & vbTab & "Error message") = 0) Then
Console.WriteLine("Invalid output list!")
End If
Dim sb As StringBuilder = New StringBuilder(5000)
' mutable string buffer for result
addstring = address.Trim & vbTab & city.Trim & vbTab & state.Trim
If (ZP4.ZP4Correct(session, addstring.Trim, sb) = 0) Then
Console.WriteLine("Correction call failed!")
End If
'Dim s() As String = sb.ToString.Split(vbTab)
Return sb
End Function
'Correct address using Google Geocoding API
Public Function GetGoogleAddress(ByVal address1 As String, ByVal city As String, ByVal state As String) As Hashtable
Dim addressxml As String = String.Empty
Dim zipCode As String = String.Empty
Dim m_hashTable As New Hashtable()
m_hashTable.Clear()
Try
address1 = address1.Replace(" CR ", " County Road ")
address1 = address1.Replace(" SR ", " State Road ")
'Making 0.5 second delay on API call
System.Threading.Thread.Sleep(500)
'Create an object of web client
Dim wsClient As New WebClient()
'Construct the URL concating the address values with it
Dim zipcodeurl As String = "?address={0},+{1},+{2}&sensor=false"
'Here in constructing the URL sensor(is required) indicates whether or not the geocoding request comes from a device with a location sensor.
Dim url As String = "http://maps.googleapis.com/maps/api/geocode/xml" & zipcodeurl
url = [String].Format(url, address1.Replace(" ", "+"), city.Replace(" ", "+"), state.Replace(" ", "+"))
'Download the data in XML format as string by making a web request
addressxml = wsClient.DownloadString(url)
'Check if status is OK then proceed
If addressxml.Contains("OK") Then
'Check if postal_code section is there in the string then proceed
If addressxml.Contains("postal_code") Then
Dim xmlDoc As New XmlDocument()
xmlDoc.LoadXml(addressxml)
Dim m_nodelist As XmlNodeList
Dim node As XmlNode
'Get the list of all address_companent nodes as this component only contans the address information
'm_nodelist = node.SelectNodes("/GeocodeResponse/result/address_component")
node = xmlDoc.SelectSingleNode("/GeocodeResponse/result")
m_nodelist = node.SelectNodes("address_component")
'From each component check for the type section for getting the particular postal_code section
Dim gcount As Integer = 0 'Just to read first entry of google map
For Each m_node In m_nodelist
'Get the zipLongName Element Value
Dim LongName = m_node.ChildNodes.Item(0).InnerText
'Get the zipShortName Element Value
Dim ShortName = m_node.ChildNodes.Item(1).InnerText
'Get the zipType Element Value
Dim Type = m_node.ChildNodes.Item(2).InnerText
'If the type of the component is postal_code then get the postal code as zipLongName
If Type = "street_number" Then
m_hashTable.Add("street_number", LongName)
End If
If Type = "route" Then
m_hashTable.Add("street", LongName)
End If
If Type = "locality" Then
m_hashTable.Add("city", LongName)
End If
If Type = "administrative_area_level_1" Then
m_hashTable.Add("state", ShortName)
End If
If Type = "postal_code" Then
zipCode = LongName
m_hashTable.Add("zip", LongName)
End If
gcount = gcount + 1
Next
End If
Else
Console.WriteLine("Invalid address or Reached the Google API Quota Limit")
End If
Catch ex As WebException
Console.WriteLine(ex.ToString())
End Try
Return m_hashTable
End Function
'Validate address using Bing Geocoding API
Public Function GetBingAddress(ByVal address1 As String, ByVal city As String, ByVal state As String) As Hashtable
'Reference:http://msdn.microsoft.com/en-us/library/ff701714.aspx
Dim addressxml As String = String.Empty
Dim zipCode As String = String.Empty
Dim m_hashTable As New Hashtable()
m_hashTable.Clear()
Try
address1 = address1.Replace(" CR ", " County Road ")
address1 = address1.Replace(" SR ", " State Road ")
System.Threading.Thread.Sleep(500)
'Create an object of web client
Dim wsClient As New WebClient()
'Construct the URL concating the address values with it
Dim keyurl As String = "&output=xml&key=BINGKEY"
'Here in constructing the URL sensor(is required) indicates whether or not the geocoding request comes from a device with a location sensor.
Dim url As String = " http://dev.virtualearth.net/REST/v1/Locations?q=" & address1.Trim & " " & city.Trim & " " & state.Trim & keyurl.Trim
'Download the data in XML format as string by making a web request
addressxml = wsClient.DownloadString(url)
'Check if status is OK then proceed
If addressxml.Contains("OK") Then
'Check if postal_code section is there in the string then proceed
If addressxml.Contains("PostalCode") Then
Dim xmlDoc As New XmlDocument()
xmlDoc.LoadXml(addressxml)
Dim nsmgr As New XmlNamespaceManager(xmlDoc.NameTable)
nsmgr.AddNamespace("rest", "http://schemas.microsoft.com/search/local/ws/rest/v1")
'Get formatted addresses: Option 1
'Get all locations in the response and then extract the formatted address for each location
Dim locationElements As XmlNodeList = xmlDoc.SelectNodes("//rest:Location", nsmgr)
Dim countAddress As Integer = 0
For Each location As XmlNode In locationElements
If countAddress < 1 Then
Try
Dim formatedAddress As String = location.SelectSingleNode(".//rest:FormattedAddress", nsmgr).InnerText
Dim splitfAdress() As String = formatedAddress.Split(",")
m_hashTable.Add("street", location.SelectSingleNode(".//rest:AddressLine", nsmgr).InnerText)
m_hashTable.Add("city", splitfAdress(1))
m_hashTable.Add("state", location.SelectSingleNode(".//rest:AdminDistrict", nsmgr).InnerText)
m_hashTable.Add("zip", location.SelectSingleNode(".//rest:PostalCode", nsmgr).InnerText)
countAddress = countAddress + 1
Catch ex As Exception
Console.WriteLine(ex.ToString)
End Try
End If
Next
End If
Else
Console.WriteLine("No OK on XML or Over Quota")
End If
Catch ex As WebException
'Console.WriteErrorLog(ex, "GetZipcode", "CompanyDetailViewerForm")
Console.WriteLine(ex.ToString())
End Try
Return m_hashTable
End Function
'Get Corrected address using mutually exclusive sequence of ZP4.dll,Google Geocoding API & Bing API
Public Function GetCorrectedAddress(ByVal address As String) As Hashtable
Dim googleAddress, bingAddress As Hashtable
Dim returnAddress As New Hashtable
Dim pAddress As Collection
Dim zp4address As StringBuilder
Dim zpaddress As String()
Dim pcity, pstate, pzip, pstaddress, zip4 As String
Dim zp4ParseStatus As String = ""
Dim ApiParseStaus As String = ""
Dim tempAddress As String
Dim oldParcelId As String = ""
Dim City, State, ZipCode, StreetAddress, ZipCode4 As String
returnAddress.Clear()
tempAddress = address
pAddress = parseAddress(tempAddress.Trim)
pcity = pAddress.Item("City")
pstate = pAddress.Item("State")
pzip = pAddress.Item("Zip")
pstaddress = pAddress.Item("Address1")
zp4address = parseZP4(pstaddress, pcity, pstate)
zpaddress = zp4address.ToString.Split(vbTab)
City = zpaddress(1)
State = zpaddress(2)
ZipCode = zpaddress(4)
StreetAddress = zpaddress(0)
ZipCode4 = zpaddress(5)
zip4 = zpaddress(3)
zp4ParseStatus = zpaddress(6)
If ((zpaddress(0) = "0") Or (zpaddress(0) = "NO") Or (zpaddress(0) = "")) Then
City = "Invalid"
State = "Invalid"
ZipCode = "Invalid"
StreetAddress = "Invalid"
ZipCode4 = "Invalid"
End If
'If ZP4 couldn't find the address
If ((zip4 = " ") Or (zip4 = "") Or (zpaddress(0) = "0") Or (zpaddress(0) = "NO") Or (zpaddress(0) = " ") Or (zpaddress(0) = "") Or (zpaddress(0) = "no") Or (zpaddress(0) = "No")) Then
'parse using Bing Geocoding API
bingAddress = GetBingAddress(tempAddress, "", "FL")
If (bingAddress.Count < 4) Then
'parse using Google Geocodign API
googleAddress = GetGoogleAddress(tempAddress.Trim, "", "FL")
Try
If (googleAddress.Count < 5) Then
ApiParseStaus = "Invalid Address:Google & Bing "
Else
City = googleAddress("city")
State = googleAddress("state")
ZipCode = googleAddress("zip")
StreetAddress = googleAddress("street_number") & " " & googleAddress("street")
ZipCode4 = ""
ApiParseStaus = "Parse Address:Google "
End If
Catch ex As Exception
Console.WriteLine("Invalid Address:Google ")
End Try
Else
City = bingAddress("city")
State = bingAddress("state")
ZipCode = bingAddress("zip")
StreetAddress = bingAddress("street")
ZipCode4 = ""
ApiParseStaus = "Parse Address:Bing"
End If
End If
returnAddress.Add("city", City)
returnAddress.Add("state", State)
returnAddress.Add("zip", ZipCode)
returnAddress.Add("zip4", ZipCode4)
returnAddress.Add("street", StreetAddress)
returnAddress.Add("googlebingstatus", ApiParseStaus)
returnAddress.Add("zp4status", zp4ParseStatus)
Return (returnAddress)
End Function
References
http://msdn.microsoft.com/en-us/library/ff701713.aspx
https://developers.google.com/maps/documentation/geocoding/
Thanks
to all programmers and bloggers for their nice articles, replies, and code
samples without them this post would not be possible.

"[ZP4] Must provide at least two fields out of City, State ,and Zip" is not true. Only a city or only a ZIP is sufficient as input, as is only a county.
ReplyDeleteAlso, you may want to check out AccuZIP's API and Web Service for real-time address correction.
ReplyDeletehttp://www.accuzip.com/products/data_quality/index.htm
ZP4 is good software but user support is non-existent. Good luck getting them to even answer and email and the user manual is less than helpful. Many other more user friendly options out there.
ReplyDeleteAnd that is founded in 1981. I bet it is started by 1 guy who is still making money selling the CDs of ZP4.
DeleteAgree!
ReplyDeleteOP/anyone: can you help me track down old ZP4 CDs? I'm working on a project where I need to match addresses coded as ZIP+4 to their street address ranges to geocode them, for each year 1999-most recent. Would be great to have 1 CD per year (or per month)!
ReplyDelete