AddressNet: How to build a robust street address parser using a Recurrent Neural Network

Published in

Towards Data Science

8 min readDec 5, 2018

Australian Postage stamps, image by author

Street addresses are complex beasts; they’re designed to be systematic and unambiguous to us human-folk, but end up a disaster for machines. Indeed, you’ll be paying US$5 per 1000 address lookups with Google, and Australia Post will point you to a selection of certified “solutions providers” while also offering discounts to those who pre-sort and barcode their mail (i.e. do the hard work).

Maintaining structured address data has value beyond ye olde postage. The ability to reduce an address to geographic coordinates (geocoding) can yield answers to questions deceptively simple as “where do I make most of my sales?”, but it doesn’t take much imagination to see how all manner of correlations could be drawn from a person’s location, e.g. in public health (the industry in which I currently work).

Now, of course it makes sense to collect this information in a structured manner from the outset, but what if you have a legacy dataset of manually entered and unvalidated address records? If you ask on StackOverflow the innocent question of…

Looking for a quick and dirty way to parse Australian street addresses into its parts:
3A/45 Jindabyne Rd, Oakleigh, VIC 3166

… you will be quickly told that “you’re going to get data that’s completely useless,” and then you’ll be shown a regular expression like this:

(?P<Unit_Number>\d*)\s*[\/,\,]\s*(?P<Street_Number>\d*)\s*(?P<Street_Name>[a-zA-Z\s]*),?\s*(?P<State>NSW|ACT|NT|QLD|SA|TAS|VIC|WA)\s*(?P<Post_Code>\d{4})| (?P<street_number>\d*)\s*(?P<street_name>[a-zA-Z\s]*),?\s*(?P<state>NSW|ACT|NT|QLD|SA|TAS|VIC|WA)\s*(?P<post_code>\d{4})

Dejected and saddened by the complex and unforgiving regular expression, you’ll slink away and lookup #ferrets on instagram because that’s really what the internet is for.

I’m here to tell you to not be sad. I mean, watch the ferrets; they’re so cute, after all… and I wouldn’t want to rob you of such cuteness. But once you’re done, come back. I’ll show you how you can build your own address parsing machine.

AddressNet, following the conventional neural network nomenclature of [Thing]+Net, is a nifty model that sorts out the bits of an address by labelling them any one of 22 possible components and is based on the GNAF database. It’s the product of about a week’s worth of work poking around TensorFlow (and de-rusting myself from a bit of a hiatus from machine learning) — I’m certain a more talented soul could produce this in a much shorter time (hash-tag-kids-these-days).

The tl;dr: If you want a Python API to chop up your Australian street addresses, give AddressNet a go.

>>> from addressnet.predict import predict_one
>>> print(predict_one("3A/45 Jindabyne Rd, Oakleigh, VIC 3166"))
{'flat_number': '3', 'flat_number_suffix': 'A', 'number_first': '45', 'street_name': 'JINDABYNE', 'street_type': 'ROAD', 'locality_name': 'OAKLEIGH', 'state': 'VICTORIA', 'postcode': '3166'}

Boom.

So how does this trickery work? I’m sorry to say that I cannot do the theory justice; many smarter than I have written at length on these topics. So I’ll explain the process used in AddressNet at a high-level and link to my favourite authors whose elegant explanations deserve no reproduction here.

At the heart of the AddressNet model is a Recurrent Neural Network (RNN) and they’re great at modelling sequential data (in this case, a sequence of letters). This kind of neural network is often shown diagramatically as:

A typical representation of RNNs (left and right are equivalent)

In the above diagram, x is an item from the input data sequence, y is some target estimation or output. The loopy arrow on the left, and the equivalently the horizontal arrows marked ‘h’ to the right, represent a “hidden state.” Each input is subscripted (t-1, t, t+1, etc.) indicating the point from which a particular sequence element originated, and this is often called referred to as the “time-step” (although this is often not time in the literal sense, but rather a “directional progression” through the elements). The big ol’ circular blobs in the middle contain all the matrix operations to produce the ys and hs, but be sure to note that each blob is identical; the exact same internal parameters are applied to each incoming h and x, thus making them recurrent.

The way RNNs operate can be superficially explained like this: for each item of data in the sequence, transform it in some way using the hidden state of the previous step to both estimate a target output while also updating the hidden state to be handed off to the next step. This mysterious hidden state can be though of as “knowledge gleaned as the sequence is scanned,” and it functions as a memory of sorts. Many variations of RNN architecture centre around how the hidden state is managed, and the most common types are LSTM- and GRU-type RNNs (AddressNet uses the GRU-type of RNN).

[Recommended reading: Understanding LSTM Networks; Understanding GRU networks]

Being able to carry forward information from earlier in the sequence is the key strength of RNNs. In natural language processing, for example, grammatical concepts such as plurality and gender and the corresponding word conjugations often are linked over a long distance. Imagine an RNN designed to identify inappropriate verb forms; a simplistic grammar checker, if you will:

A hypothetical NLP RNN to detect grammar issues. Long distance dependencies highlight the importance of “memory”

Each x represents a word of the sentence, and each y yields a 0 if there is no mistake, and a 1 if there is a mistake. Here, the verb “is” should actually be “are” since the noun “diagrams” is plural. The ability to retain some memory of prior elements allows this hypothetical network to generate outcomes framed as “given what I have seen so far, what is the probability that the current element is a mistake?”

RNNs can be extended to be multi-layer (outputs at each step are fed in as the inputs of another RNN), and they can also be paired with an equivalent RNN where the horizontal arrows are in the reverse direction, passing the hidden state backwards, effectively looking into the “future” to learn about the “past.” The latter is known as a bidirectional RNN.

[Recommended reading: Bidirectional recurrent neural networks / PDF ]

A multi-layer bidirectional RNN is used in AddressNet on a per-character basis. The bidirectionality is particularly important: consider the address “10 24–26 High Street Road, Mount Waverley” Assigning each character a class when only looking forward in time is almost impossible for the first character! Yes, maybe the unit/flat number is most frequently the first character, and perhaps there are more units than stand-alone properties overall, so we may statistically stand a chance, but it’s really just a stab in the dark. Similarly, the without any ability to look into the future, “Street” is probably indicating the type of street… but then we encounter “Road,” only to learn that we’ve made a mistake and the name is actually “High Street” of type “Road.” In the AddressNet implementation, one thing that differs from the diagrams above is that the output from the AddressNet RNN is not the predictions directly, but the forward and backwards pass are joined together and put through a final “fully connected” neural network — these are the standard textbook networks that are #1 in a Google Image search. If you want to see the exact implementation of this network, check out this part of the code.

But how are characters represented in the first place? What is ‘A’? How can we shove ‘A’ — a letter — into a this magical bundle of mathematics? There are two main ways that textual data can be represented in neural networks: the “one-hot” vector approach, and the dense “embedding vector” approach. The former consists of a vector that is as long as our vocabulary (could be a list of words, characters, or even emoji 🤷‍♀️) containing a ‘1’ at the index corresponding to the item in our dictionary and zeros elsewhere, whereas the embedding vector approach is achieved through an arbitrarily long (but usually much shorter than the vocabulary) vector with parameters that are learnt by the model; that is, even the representations themselves are optimised as we train. Historically, embedding vectors have been used to derive semantic word relationships, but here I just use them for dimensionality reduction (and I generally like the idea of optimising the input parameters). AddressNet uses the embedding approach with short eight-unit vectors per character.

[Recommended reading: TensorFlow docs, and anything on word embeddings (e.g. Mikolov’s word2vec paper)]

Let’s talk training data. Alas I didn’t — and still don’t — have any labelled, curated, real-world address data. Instead, my approach was to synthetically generate as many random permutations (that were just shy of being unintelligible in some cases) that remain within the realms of plausible. There is a mish-mash of code that can be deciphered to understand the precise ways I butchered the GNAF dataset, but in essence the code makes the following decisions for each record in the GNAF dataset:

Keep or drop the suburb?
Keep or drop the state?
Abbreviate the state? (e.g. changing “Victoria” to “VIC”)
Keep or drop the postcode?
Keep or drop the street number?
Abbreviation the street type? (e.g. changing “Road” to “RD”)
Keep or drop the unit type?
Convert level numbers to ordinal or cardinal words? (e.g. changing “1” to “1st” or “first”)

The components are then shuffled in a special way (keeping thing like state and postcode nearer to the end, for example) and then each of the components is subjected to some random corruption that is designed to be somewhat like the typos a human might make. Typos include:

Character substitution (using nearby characters on the keyboard)
Deletion
Transposition
Duplication

Then finally, each component is stitched together using some random delimiters. These might be spaces, dashes, slashes or full-stops (depending on the part of the address).

But whyyyyy? Well there are two reasons: first, these are likely to actually occur in the real world and the model should be robust enough to handle these. Second, this extra noise is intended to function as a form of regularisation, which is a term for methods that reduce the likelihood that a model overfits.

Properly mutating and corrupting the data was a high priority! While developing this model, I was acutely aware that I had no test data; there was no a priori way for me to know how well the model would perform in the real world. Ideally I would have a fully-labelled test set from manually entered addresses, however I am but one person working on a spare-time project. Thus it was key for me to ensure that there was so much variability in the input data that the model complexity could not start to “learn the noise.” Experience with the trained model suggests that I hit the mark, but I would love you, the reader, to give it a go and let me know how it goes. I know, I know, shame on me for having no test data 😔.

Ready to try it out? I’m sure you said yes. Just pip install it and see what you think: https://github.com/jasonrig/address-net

Contributions, feedback, corrections and suggestions are welcome. Let me know if you use AddressNet in, or adapt it for, your own project. May you never need a regular expression for your address data. 🙏

AddressNet: How to build a robust street address parser using a Recurrent Neural Network

Written by Jason Rigby