The world’s leading publication for data science, AI, and ML professionals.

Filtering Tweets by Location

Regular Expressions + Account Location metadata == Location Filtering

Account Location metadata + Regular Expressions == Tweet Location Filtering

In my latest project, I explored the question, "What is the public sentiment in the United States on K-12 learning during the COVID-19 pandemic?". Using data collected from Twitter, Natural Language Processing, and Supervised Machine Learning, I created a text classifier to predict Tweets’ sentiment on this topic.

Since I wanted to hone in on sentiment in the United States, I needed to filter **** Tweets by location. The Twitter Developer site offers some good guidance here on the available options.

I choose to use the Account Location geographical metadata. Here are the details from the Twitter Developer website: "Based on the ‘home’ location provided by the user in their public profile. This is a free-form character field and may or may not contain metadata that can be geo-referenced".

Before we begin, here are a few caveats:

  • Since Account Location is not guaranteed to be populated, you have to accept the fact that you’ll potentially miss out on relevant Tweets.
  • The approach I used depends on Account Location containing a US identifier(such as a valid US state name).

If those caveats are acceptable to you, keep reading. 🙂

Before I share the actual code, here’s a rundown of my methodology:

  • I created a Python script to listen in on Twitter Stream for pertinent Tweets on my subject(s) of interest.
  • Upon getting a Tweet, I get the Account Location attribute and use regular expressions to check if it’s a US Location.
  • Tweets that pass the Location check carry on through my code for further processing. (In my case, I stored the Tweet in a MongoDB collection.)

Now that I’ve shared the overall workflow let’s look at some code.

My script uses a tweepy.StreamListener to listen for incoming Tweets off the Twitter Stream. The magic happens in my StreamListener’s on_status function.

def on_status(self, tweet):
    # Checking tweet for likely USA location
    is_usa_loc = self.get_is_usa_loc(tweet)
    if is_usa_loc:
        # insert the tweet into collection
        twitter_collection.insert(tweet_doc)
    else:
        print('Could not determine USA location')
def get_is_usa_loc(self, tweet):
    is_usa_loc = False
    if tweet.user.location:
        user_loc_str = str(tweet.user.location).upper()
        is_usa_loc = re.search(usa_states_regex, user_loc_str) or re.search(usa_states_fullname_regex, user_loc_str)
    return is_usa_loc

Within on_status, I call a helper function: get_is_usa_loc. In this helper function, I first check to ensure that Account Location is actually populated in the Tweet. If so, I convert Account Location to an upper case String and check it against two regular expressions. One regex checks for US state two-character abbreviations. The other regex checks for US state full names. The Account Location only has to match one. 🙂

Here are the regular expressions I used. The first is for state abbreviations, and the second is for full state names and the term ‘USA’.

usa_states_regex = ',s{1(A[KLRZ]|C[AOT]|D[CE]|FL|GA|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])'

usa_states_fullname_regex = '(ALABAMA|ALASKA|ARIZONA|ARKANSAS|CALIFORNIA|'
                            'COLORADO|CONNECTICUT|DELAWARE|FLORIDA|GEORGIA|HAWAII|'
                            'IDAHO|ILLINOIS|INDIANA|IOWA|KANSAS|KENTUCKY|'
                            'LOUISIANA|MAINE|MARYLAND|MASSACHUSETTS|MICHIGAN|'
                            'MINNESOTA|MISSISSIPPI|MISSOURI|MONTANA|'
                            'NEBRASKA|NEVADA|NEWsHAMPSHIRE|NEWSJERSEY|'
                            'NEWsMEXICO|NEWsYORK|NORTHsCAROLINA|'
                            'NORTHsDAKOTA|OHIO|OKLAHOMA|OREGON|PENNSYLVANIA|'
                            'RHODEsISLAND|SOUTHsCAROLINA|SOUTHsDAKOTA|'
                            'TENNESSEE|TEXAS|UTAH|VERMONT|VIRGINIA|'
                            'WASHINGTON|WESTsVIRGINIA|WISCONSIN|WYOMING|USA)'

And that’s it! I’ve just walked you through my process of filtering Twitter Data by Account Location using some straightforward Python code along with regular expressions.

If you’ve found this helpful, I encourage you to expand on this example in your own work. Please feel free to reach out with suggestions for improvement or let me know if this has helped you.

Happy filtering!


Related Articles