Hands-on Tutorials

Data Analysis: Everything You’ve Ever Wanted to Know about UFO Sightings

A tidyverse breakdown of the NUFORC UFO sighting data

Travis Greene
Towards Data Science
19 min readJul 3, 2018

--

Photo by Jaredd Craig on Unsplash

If the words data and aliens interest you, you’re in the right place. In this post, we dig into 80,000+ NUFORC (National UFO Reporting Center) UFO sighting reports. In the course of this investigation, we’ll be using a host of methods in R, from treemaps, lollipop charts, and network diagrams, to geographical maps and even a couple of statistical tests.

What might we learn from this? At the very least, if we want to maximize our chances of seeing a UFO, we might learn where and when to look for one. You might even learn some nifty R tricks for cleaning and visualizing your data. You can access the data here at NUFORC.

My UFO Story

Ever since I saw a UFO in the middle of broad daylight in the 3rd grade, I’ve been interested in aliens and UFOs. Here’s what happened.

It was just another Sunday afternoon at the local park in La Mesa, CA. The year must have been around 1993 or so. As my buddy and I were skateboarding, something in the sky caught my eye. We looked up to watch a shiny, silver disk hovering effortlessly and silently in the blue sky, shooting across the lengths of entire clouds instantaneously. I don’t mean it moved quickly. I mean it did not move — it simply appeared at different locations. The best way I can describe it would be to imagine taking a pencil and poking holes in a two-dimensional sheet of paper. If you were facing the paper, unaware of the pencil moving behind the paper, the holes would appear “instantaneously.” In reality, the pencil is simply moving in another “hidden,” third dimension.

The UFO was not far from us, either, at a height you might see a helicopter. I distinctly remember the way the sun glared off the metallic exterior of the craft — it was just like the UFOs in the movies and the ones you hear about on your local news. Shocked, we ran and told our confused but intrigued parents, who then came and watched with us for another ten minutes or so. All in all, I believe we had at least five witnesses. Eventually, the UFO shot up into the atmosphere at a hyper-drive-like speed that can only be described as physically impossible. It simply vanished upwards, like a piece of dust being vacuumed up a tube.

Can Data Analysis Help us Figure Out the Mysteries of UFOs?

Ever since then, I’ve tried to figure out what exactly UFOs might be. Could a government be so stupid as to test secret, highly-advanced technology in the middle of the day in San Diego? Not likely. Let’s see if we can find out anything that might help us to decide whether UFOs are indeed human or alien.

Data Clean Up and Counting

library(tidyverse)
library(tidytext)
library(ggmap)
library(stringr)
df <- read.csv('ufo.csv', stringsAsFactors = FALSE)

#delete any cities with punctuation. focus on US
#if you put df[which...] then you get a data frame back. we just want vector of indices
bad <- which(str_detect(df$city, '[[:punct:]]'))

df <- df[-bad,]

df %>%
count(city, state, shape)%>%
arrange(desc(n))%>%
head()
## # A tibble: 6 x 4
## city state shape n
## <chr> <chr> <chr> <int>
## 1 seattle wa light 113
## 2 phoenix az light 90
## 3 san diego ca light 78
## 4 portland or light 77
## 5 las vegas nv light 68
## 6 los angeles ca light 63

Now we can use the great tidytext package to dive into the reports and see what we can find in the text descriptions.

big <- df %>%
unnest_tokens(word, comments, token='ngrams', n=2)%>%
anti_join(stop_words)%>%
count(word, sort=TRUE)
#solve problem of bigrams with stop words
bigrams_separated <- big %>%
separate(word, c("word1", "word2"), sep = " ")

#keep only alphabetical words and longer than 2 letters
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)%>%
filter(str_detect(word1, '[[:alpha:]]'))%>%
filter(str_detect(word2, '[[:alpha:]]'))%>%
filter(nchar(word1) > 2)%>%
filter(nchar(word2) > 2)%>%
filter(word1 != 'ufo')%>%
filter(word2 != 'ufo')

#most common types of lights seen
lights <- bigrams_filtered %>%
filter(word2 == 'light' | word2 == 'lights')%>%
unite('bigram', -n, sep=' ')

#What type of shapes?
shapes <- bigrams_filtered %>%
filter(word2 == 'shape' | word2 == 'shaped')%>%
unite('bigram', -n, sep=' ')

#movement
mvt <- bigrams_filtered %>%
filter(word2 =='movement' | word2 == 'movements')%>%
unite('bigram', -n, sep=' ')

speed <- bigrams_filtered %>%
filter(word2 == 'speed' | word2 == 'speeds')%>%
unite('bigram', -n, sep=' ')

Visualizing UFO Characteristics by Treemap

Treemaps work by making the area of the rectangles proportional to some variable in our dataframe. In this case, we have counts of the bigrams, so we will set the size of the rectangles to reflect the count of each pair. Hopefully this will tell us something interesting about the most common characteristics of UFOs.

# treemap. the size of the box is proportional to its count among all lights
treemap(speed,
index="bigram",
vSize="n",
type="index",
fontsize.labels = 6,
title= 'UFO Speed Words'
)
Image by Author

As I expected, most people say that UFOs move at incredibly fast speeds. I like the description of “warp speed.” I’d say the UFO I saw also moved at a laws-of-physics-defyingly fast speed.

Image by Author

It seems like there’s a definite pattern of UFOs having erratic and unusual kinds of movements.

treemap(shapes,
index="bigram",
vSize="n",
type="index",
fontsize.labels = 6,
title= 'UFO shape Words'
)
Image by Author

Interestingly lots of cigar-shaped and triangle shaped UFOs. I’ve never heard of Chevron/boomerang shaped UFOs before.

treemap(lights,
index="bigram",
vSize="n",
type="index",
fontsize.labels = 6,
title= 'UFO Lights Words'
)
Image by Author

It looks like most of the lights seen are either white, orange, red, or blue lights.

Visualizing States and Cities by Lollipop chart

Another way to represent the same information is a lollipop chart. These are essentially slicker-looking versions of bar charts, where the length of the bar is proportional to some column in our dataframe. Remember that in ggplot2, we use the aes() argument to map variables (i.e., columns) onto our plots. This is an extremely useful aspect of ggplot.

#city state counts of sightings
state_counts <- df %>%
filter(state != '')%>%
count(city, state, sort=TRUE)%>%
unite('location', -n, sep=',')

#visualize with lollipop chart
state_counts %>%
filter(n > 90)%>%
mutate(x = factor(location))%>%
ggplot(aes(x, n))+
geom_segment(aes(x=reorder(x,n), xend=x, y=0, yend=n), size=1)+
geom_point(color='red', size=2, alpha=.6)+
coord_flip()+
theme_minimal() +
theme(axis.text.y = element_text(size = 7))+
labs(title='Which cities have had the most sightings?')
Image by Author
#count cities and shape
#seattle fireballs sounds like a basketball team
#get rid of others because they don't contain useful info
df %>%
filter(!shape %in% c('', 'unknown', 'other', 'light'))%>%
count(city, shape, sort=TRUE)%>%
filter(n > 20)%>%
unite('type', -n, sep=' ')%>%
mutate(type = factor(type))%>%
ggplot(aes(type, n))+
geom_segment(aes(x=reorder(type,n), xend=type, y=0, yend=n), size=1)+
geom_point(color='red', size=2, alpha=.6)+
coord_flip()+
labs(title='What shapes are most commonly seen in each location?',
x = 'City and Type of Shape', y='# of sightings')+
theme_minimal()
Image by Author

Using lubridate to find patterns over time

The lubridate package is great. It makes working with dates and times easy. If you’re familiar with xts or ts objects, you’ll know that working with dates and times in R can be cumbersome.

One really handy function is floor_date(), which allows you to essentially aggregate observations into buckets of time. It’s similar to how you might choose your bin size in a histogram, but here we are choosing what interval of time to consider a ‘bin’ or ‘bucket.’ As you change the interval (i.e., unit of time), you’ll see different patterns emerge as the zoom level of the data changes. You’ll need to use your judgment to figure out which level of aggregation best fits your needs.

library(lubridate)

#biggest single events reported one time tinley park 17 counts
df$datetime <- mdy_hm(df$datetime)

#1 year intervals
df %>%
mutate(datetime = floor_date(datetime, unit='1 year'))%>%
group_by(datetime)%>%
filter(datetime > '1939-01-01')%>%
summarize(cts = n())%>%
ggplot(aes(datetime, cts))+
geom_line()+
scale_x_datetime(date_breaks = '5 years', date_labels = '%Y') + theme(plot.subtitle = element_text(vjust = 1),
plot.caption = element_text(vjust = 1),
panel.background = element_rect(fill = NA)) +labs(title = "UFO Sightings Since 1939",
x = "Date", y = "Sightings")
#which states have the most sightings?
#ca, wa, fl, tx, ny
df %>%
count(state, sort=T)%>%
slice(1:10)

## # A tibble: 10 x 2
## state n
## <chr> <int>
## 1 ca 8998
## 2 wa 3949
## 3 fl 3792
## 4 tx 3445
## 5 ny 2635
## 6 il 2460
## 7 az 2454
## 8 pa 2406
## 9 oh 2301
## 10 mi 1930

California, Washington and Florida see the most UFOs. Not surprising given their large populations.

Image by Author
#top 10 states by sightings
df %>%
mutate(datetime = floor_date(datetime, unit='1 year'))%>%
group_by(datetime, state)%>%
filter(datetime > '1950-01-01')%>%
filter(state %in% c('ca', 'wa', 'fl', 'tx','ny', 'il','az','pa','oh','mi'))%>%
summarize(cts = n())%>%
ggplot(aes(datetime, cts, color=state), alpha=.5)+
geom_line(size=1)+
scale_x_datetime(date_breaks = '5 years', date_labels = '%Y') +
theme(plot.subtitle = element_text(vjust = 1),
plot.caption = element_text(vjust = 1),
panel.background = element_rect(fill = NA)) +labs(title = "UFO Sightings Since 1950",
x = "Date", y = "Sightings")
Image by Author

California has dominated with sightings, though Washington took over in the mid 1990s for just a year or so. What about which cities have seen the most UFOs?

#which cities?
cities <- df %>%
count(city, sort=T)%>%
slice(1:10)

cities <- cities[,1]
cities

## # A tibble: 10 x 1
## city
## <chr>
## 1 seattle
## 2 phoenix
## 3 portland
## 4 las vegas
## 5 los angeles
## 6 san diego
## 7 houston
## 8 chicago
## 9 tucson
## 10 miami

How have UFO sightings changed over time in the most popular UFO locations?

#tracking 10 cities
df %>%
mutate(datetime = floor_date(datetime, unit='5 years'))%>%
group_by(datetime, city)%>%
filter(datetime > '1970-01-01')%>%
filter(city %in% c('seattle', 'phoenix','portland', 'san diego', 'los angeles', 'houston', 'las vegas', 'chicago', 'tucson', 'miami'))%>%
summarize(cts = n())%>%
ggplot(aes(datetime, cts, color=city), alpha=.8)+
geom_line(size=1)+
scale_x_datetime(date_breaks = '5 years', date_labels = '%Y') +
theme(plot.subtitle = element_text(vjust = 1),
plot.caption = element_text(vjust = 1),
panel.background = element_rect(fill = NA)) +labs(title = "5 year aggregate UFO Sightings Since 1970",
x = "Date", y = "Sightings")
Image by Author

We can see that all of the top cities have experienced roughly the same increase in sightings over time. Seattle and Phoenix stand out as having the most in a single year. Strangely sightings seem to have decreased since 2008.

How have UFO shapes changed over time? Are cigar-shaped UFOs as common as disk-shaped ones?

#how have shapes changed over time? 1 year aggregrates
df %>%
mutate(datetime = floor_date(datetime, unit='1 year'))%>%
filter(datetime > '1950-01-01')%>%
group_by(datetime, shape)%>%
filter(!shape %in% c('', 'changed', 'changing', 'other', 'unknown'))%>%
summarize(cts = n())%>%
ggplot(aes(datetime, cts, color=shape))+
geom_line(size=1, alpha=.6)+
scale_x_datetime(date_breaks = '5 years', date_labels = '%Y')+
theme_minimal()
Image by Author

Looks like lights are the most common now, though it wasn’t always that way. Disks dominated back in the day.

Instead of counts, let’s look at the proportion of each shape sighted:

#5 year intervals proportion of shapes sighted
df %>%
mutate(datetime = floor_date(datetime, unit='5 years'))%>%
filter(datetime > '1950-01-01')%>%
group_by(datetime, shape)%>%
filter(!shape %in% c('', 'changed', 'changing', 'other', 'unknown'))%>%
summarize(cts = n())%>%
mutate(freq = cts/sum(cts))%>%
filter(freq > .05)%>%
ggplot(aes(datetime, freq, color=shape))+
geom_line(size=1)+
scale_x_datetime(date_breaks = '5 years', date_labels = '%Y')+
theme_minimal()+
labs(title='Freq of shape in 5 year aggregate sightings since 1950')
Image by Author

Since the 1950s disks have gone down massively in popularity, while triangles rose to the top in the late 1980s, then were finally overtaken by lights. Spheres and circles have remained stable at around 10% of all sightings.

How have shapes changed in California over the past 50 years?

df %>%
mutate(datetime = floor_date(datetime, unit='10 years'))%>%
filter(datetime > '1950-01-01')%>%
group_by(datetime, shape, state)%>%
filter(state == 'ca')%>%
filter(!shape %in% c('', 'changed', 'changing', 'other', 'unknown'))%>%
summarize(cts = n())%>%
ungroup()%>%
group_by(datetime, add=TRUE)%>%
mutate(freq = cts/sum(cts))%>%
filter(freq > .05)%>%
ggplot(aes(datetime, freq, color=shape))+
geom_line(size=1, alpha=.6)+
scale_x_datetime(date_breaks = '20 years', date_labels = '%Y')+
facet_wrap(~ state)+
theme_minimal()+
labs(title='Freq of shape in 10 year aggregate sightings since 1950')
Image by Author

Takeaways

  1. Disks peaked in the 1950s and hardly ever appear now.
  2. Lights were rare until the late 1970s and then exploded.
  3. Triangles seem to have peaked in the late 1980s.
  4. Fireballs steadily grew in popularity until the late 1980s and early 1990s, then took a small hit and recovered. Maybe as a result of the UFO abduction movie Fire in the Sky?
  5. Cigars died out in the 1960s. Aliens’ leases expired and they traded up for lights?

Are UFO sightings seasonal?

Are there daily patterns to sightings? What time of day is best to maximize one’s chances of seeing a UFO? We’ll use the super useful functions from lubridate that extract the different pieces of time out the datetime objects we created. Once we mutate the new columns it’s super easy to groupby() on these times and make nice visuals.

df <- df %>%
mutate(day = day(datetime),
month = month(datetime),
year = year(datetime),
hour = hour(datetime))

Now we are ready to start grouping and visualizing.

#I'm only showing 24 years' of data here due to size limits. But the same pattern persists up till 2014.
df %>%
mutate(month = factor(month),
day = factor(day))%>%
filter(between(year, 1950, 1974))%>%
group_by(year, month)%>%
summarize(cts = n())%>%
ggplot(aes(month, cts, group=year))+
geom_line()+
facet_wrap(~ year, ncol = 5, scales = 'free_y')+
labs(title='Are UFO sightings seasonal? Apparently yes')
Image by Author

This was really unexpected: UFOs are extremely seasonal! Nearly every year shows the same distribution of sightings through the months with a big peak between June and July in almost every year. What could possibly explain that?

What about time of day? Do more sightings happen in the morning or evening?

df %>%
group_by(day, hour)%>%
summarize(cts = n())%>%
ggplot(aes(hour, cts, group=day))+
geom_line()+
facet_wrap(~ day)+
labs(title='Does time of UFO sighting depend on the day of the month?')
Image by Author

I’m surprised by the fact the most sightings are happening around 9–11 p.m. Of course at night our vision isn’t good and there are so many other things in the sky competing for our attention.

Also, look at the two peaks visible in the chart above: first, the big peak on the 4th and the big peak around midnight on the 1st. We’ll explain these shortly.

I realize now that my daytime sighting was a very rare event. How rare? Let’s estimate the probability of seeing a UFO at 3 p.m.

df %>%
count(hour)%>%
mutate(freq = n/sum(n),
cum_prob = cumsum(freq))%>%
slice(15:17)
## # A tibble: 3 x 4
## hour n freq cum_prob
## <int> <int> <dbl> <dbl>
## 1 14 1042 0.01588608 0.3131784
## 2 15 1147 0.01748689 0.3306653
## 3 16 1282 0.01954507 0.3502104
#plot the estimated cumulative probability of sightings by hour in the day
df %>%
count(hour)%>%
mutate(freq = n/sum(n),
cum_prob = cumsum(freq))%>%
ggplot(aes(hour, cum_prob))+
geom_area(alpha = .5)+
geom_vline(xintercept = 15, color='red')+
geom_hline(yintercept = .33, color='red')+
labs(title='Estimated cumulative probability of UFO sightings by hour in the day')
Image by Author

Apparently only about 1.7% of all sightings occur between 3–4 p.m. We can also see that only about 33% of sightings occur between midnight and 3 p.m. The vast majority happen between 8–10 p.m. More specifically, nearly 50% of sightings (actually I should say reports of sightings) occur between 8 p.m. and midnight.

What about the day of the week? Do more sightings happen on the weekends? I saw mine on a Sunday afternoon. We’ll use lubridate’s useful wday() to get the name of the day.

library(stringr)
df %>%
mutate(dow = wday(datetime, label=T))%>%
count(month, dow)%>%
ggplot(aes(dow, n, fill=dow))+
geom_col()+
guides(fill=FALSE)
Image by Author

Saturdays and Sundays do indeed see a big uptick in sightings. But does this always hold in all months? If you’re familiar with Simpson’s Paradox, then you’d know that sometimes aggregate results can be completely at odds with results broken down into groups.

df %>%
mutate(dow = wday(datetime, label=T))%>%
count(month, dow)%>%
ggplot(aes(dow, n, fill=dow))+
geom_col()+
facet_wrap(~ month)+
guides(fill=FALSE)
Image by Author

As suspected, there are more sightings on the weekends. There is more time to stare into the sky and drink alcohol, all of which would explain the increase in sightings. The gradual ramping up of sightings through the week is similar to what you would see in retail sales. Monday is slow then shopping picks up through the week, with the biggest bumps on the weekends. This suggests to me that the sightings are more of a function of people’s weekly schedules/habits than any real alien phenomenon.

Is this weekend bump a random fluke? Let’s conduct a Chi-Squared significance test of the day of the week. We’ll assume as a null hypothesis that sightings are equally likely to occur on any day of the week.

cn <- df %>%
mutate(dow = wday(datetime, label=T))%>%
count(dow)%>%
ungroup()
cn

## # A tibble: 7 x 2
## dow n
## <ord> <int>
## 1 Sun 9533
## 2 Mon 8191
## 3 Tue 8856
## 4 Wed 9053
## 5 Thu 8989
## 6 Fri 9477
## 7 Sat 11493

#Goodness of fit test:
#Null hypothesis: Each day is equally likely for a UFO sighting
#What is the probability of getting the observed DOW counts we did under this condition?
chisq.test(cn$n)

##
## Chi-squared test for given probabilities
##
## data: cn$n
## X-squared = 687.82, df = 6, p-value < 2.2e-16

Yup, it’s almost certain that weekend sighting rates are substantially higher than weekday, assuming the null hypothesis of equal probability of occurrence.

Are there certain months where UFO sightings occur at different times?

df %>%
group_by(state)%>%
filter(state %in% c('ca', 'wa', 'tx', 'fl', 'ny'))%>%
count(hour,month)%>%
ggplot(aes(hour, n, color=state))+
geom_line()+
facet_wrap(~ month)+
theme_minimal()+
labs(title='Why does July have a big spike?')
Image by Author

This graphic shows us that for the top 5 states, all have similar UFO sighting times throughout the day. Yet in July we notice a spike coming from more sightings past 8 p.m. Could it have something to do with the Fourth of July holiday? More people are outside and looking at the sky for fireworks, maybe?

df %>%
group_by(state)%>%
filter(state %in% c('ca', 'wa', 'tx', 'fl', 'ny'))%>%
count(hour,month,day)%>%
ggplot(aes(day, n, color=state))+
geom_line(size=1.5)+
facet_wrap(~ month)+
labs(title='People apparently can\'t tell fireworks and UFOs apart')+
theme_minimal()
Image by Author

The Disappointing Conclusion: Fireworks Mistaken as UFOs

The two days with the biggest spikes in sightings are indeed the Fourth of July and New Year’s Eve, two days where tons of people shoot fireworks into the night sky. I was hoping there would be a much cooler reason for the mysterious spike in sightings during the middle of the year — but it turns out it’s likely just people mistaking fireworks for alien spacecraft.

The only cyclical pattern in UFO sightings is that of Americans’ weekly routines. I don’t see much evidence here that people see UFOs beyond what they would be expected to given their increased leisure time and consumption of alcohol on weekends and holidays.

With this section concluded, let’s go back to the textual descriptions and see if we can find anything of interest there.

Finding patterns in the UFO descriptions using network visualizations

The igraph and ggraph packages will help us here. And I highly recommend the free tidytext book by David Robinson and Julia Silge.

#need 3 things after have igraph object: node edges and labels
library(igraph)
library(ggraph)big_graph <- bigrams_filtered %>%
filter(n > 200)%>%
graph_from_data_frame()

#for some reason the edge_alpha argument isn't changing. It should make alpha levels proportional to the number of occurrences in the text.
a <- grid::arrow(type = "open", length = unit(.05, "inches"))

ggraph(big_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "red", size=2) +
geom_node_text(aes(label = name), vjust = 2, hjust = 1) +
theme_void()
Image by Author

This graph gives a nice overview of the major descriptions of the UFOs. The direction of the arrows indicate whether the word was the 1st or 2nd word in the bigram pair.

Geographic visualization of UFO hotspots

To conclude our analysis, let’s focus on geographic locations to see if we can find any “hotspots” or patterns in sightings.

In order to plot using latitude and longitude, we’ll follow the same strategy I outlined in my ggmaps tutorial. First, groupby() location, then average the latitude and longitude of each city. If you want to see how that was done, go here.

Note: I believe Google Maps has changed their API. Now you must sign up for access.

us_map <- get_map('USA', zoom = 3,maptype = 'satellite')%>%
ggmap()
us_map
Image by Author
#cities with over 50 sightings
over_50 <- df %>%
count(city,state)%>%
filter(n > 1)%>%
unite('location', -n, sep=',')
head(over_50)
## # A tibble: 6 x 2
## location n
## <chr> <int>
## 1 29 palms,ca 2
## 2 abbeville,la 4
## 3 abbeville,sc 2
## 4 aberdeen,md 5
## 5 aberdeen,nj 2
## 6 aberdeen,sd 2
#Now it's in the shape we need to plot. Not gunna lie: this took me like 10 mins to get working
lat_long <- df %>%
unite('location', c('city','state'), sep=',')%>%
filter(location %in% over_50$location)%>%
mutate(latitude = as.numeric(latitude),
longitude = as.numeric(longitude))%>%
group_by(location)%>%
summarize(lon = mean(longitude, na.rm=TRUE),
lat = mean(latitude, na.rm=TRUE),
cts = n())%>%
ungroup()
head(lat_long)
## # A tibble: 6 x 4
## location lon lat cts
## <chr> <dbl> <dbl> <int>
## 1 29 palms,ca -116.05417 34.13556 2
## 2 abbeville,la -92.13417 29.97444 4
## 3 abbeville,sc -82.37917 34.17806 2
## 4 aberdeen,md -76.16444 39.50944 5
## 5 aberdeen,nj -74.22212 40.41710 2
## 6 aberdeen,sd -98.48611 45.46472 2

We are ready to visualize by overlaying our dataframe on top of our country map.

us_map +
geom_point(data=lat_long, aes(lon, lat, color=cts, size=cts, alpha=cts))+
scale_color_gradient(low='blue', high='red')+
guides(color=FALSE)+
labs(title='US UFO sightings since 1944')
Image by Author

Besides most sightings happening in big cities on the coasts (where most of the people live), it is interesting to see a big vertical gap in the Midwest where almost no sightings occur. I wonder if geographical features (mountains/rivers) are playing a role. There’s almost a vertical line splitting the US into two halves.

Where are the UFOs in Kansas, Nebraska, South Dakota, and North Dakota? There also is a gap in sightings in the region that appears to be Northern Nevada and Idaho.

Can we zoom in to better see?

us_idaho <- get_map('Idaho, USA', zoom=5, maptype = 'satellite')%>%
ggmap()

us_idaho +
geom_point(data=lat_long, aes(lon, lat, color=cts, size=cts, alpha=cts))+
scale_color_gradient(low='blue', high='red')+
guides(color=FALSE)+
labs(title='Why no UFOs in Idaho?')
Image by Author

Clearly the bigger the population, the more opportunity for sightings. It could be the relatively sparsely populated areas that account for the lack of sightings. Or it could be that UFOs have an aversion to potatoes. I’m not sure we’ll ever know. For now, I’m sticking with the potato hypothesis.

Can we visualize key words in different regions?

In order to do this we will look at single words that are most common in each region.

library(ggrepel)words_map <- df %>%
unite('location', c('city','state'), sep=',')%>%
unnest_tokens(words, comments)%>%
filter(!words %in% stop_words$word) %>%
filter(!words %in% c('las', 'los', 'san', 'quot', 'diego', 'sky', 'angeles', 'object',
'light', 'lights', 'beach', 'jose', 'francisco', 'antonio',
'tinley', 'myrtle', 'salt', 'texas', 'bright',
'moving', 'monica', 'colorado', 'city', 'barbara','flying',
'shaped', 'shape', 'santa', 'object', 'objects', 'craft',
'moved', 'alaska', 'downtown', 'north', 'south', 'east', 'west',
'rapids','sighting', 'cajon', 'simi', 'boca', 'paso',
'lauderdale', 'grand', 'puget', 'nuforc', '39s',
'looked', 'nyc', 'obj', 'cruz', 'missouri','springs', 'note',
'appeared', 'hotel', 'night', 'park', 'red', 'palm',
'des', 'moines'))%>%
filter(str_detect(words, '[[:alpha:]]'))%>%
filter(nchar(words) > 2)%>%
filter(words != 'ufo')%>%
filter(!words %in% df$city)%>%
count(location, words, sort=T)%>%
filter(n > 12)%>%
inner_join(lat_long, by='location')
head(words_map)
## # A tibble: 6 x 6
## location words n lon lat cts
## <chr> <chr> <int> <dbl> <dbl> <int>
## 1 seattle,wa sound 37 -122.3308 47.60639 524
## 2 seattle,wa green 35 -122.3308 47.60639 524
## 3 seattle,wa hovering 27 -122.3308 47.60639 524
## 4 phoenix,az formation 24 -112.0733 33.44833 450
## 5 seattle,wa fast 24 -122.3308 47.60639 524
## 6 las vegas,nv green 23 -115.1364 36.17500 363
us_map +
geom_point(data=words_map, aes(lon, lat, alpha=n, size=n), color='red')+
geom_text_repel(data = words_map, aes(label = words, size=n), color='green')+
labs(title='Key words from UFO sightings around the US')
Image by Author

Recapping the results

First, you should note that I removed the word “red” and “night” to make room for other, possibly more meaningful words. So if you see some red stuff at night, there’s a good chance it’s a UFO.

Second, we see that fast glowing fireballs seem more common around Seattle. Though in Southern Texas there also seems to be many green fireballs. By the way, the Seattle Fireballs would make an awesome NBA team name. Just saying.

It is also interesting to see which areas witness UFO formations. It looks like the areas near Lake Havasu, CA and Phoenix, AZ tend to see more UFO formations.

Final theory to explain the lack of UFOs near Idaho…

Is it just a coincidence that the areas without many UFOs seem to be located in the general area where Area 51 is claimed to exist? Probably not. Most of the terrain is just sand and mountains with very few people around to spot a UFO. I would be very surprised if we found a location with a very small population but a lot of UFO sightings. That would indicate something abnormal is happening there.

m <- get_map('Groom Lake, Nevada', zoom = 6, maptype='terrain')%>%
ggmap()

m + geom_point(data=lat_long, aes(lon, lat, color=cts, size=cts))+
scale_color_gradient(low='blue', high='red')+
guides(color=FALSE)+
labs(title='Sightings surrounding Area 51')
Image by Author

Preliminary Conclusions

Based on this graphic there don’t seem to be any unpopulated areas with high amounts of UFOs: UFO reports are basically a function of having more people around to make UFO reports — not any underlying increase in unidentified flying objects. This goes to show you the limits of pure data analysis without causal knowledge: garbage in, garbage out. I would say that probably a very high percentage of these reports are false positives.

Once we figured out that 4th of July and New Year’s Eve fireworks were responsible for the big spikes in reports, I lost a lot of faith in the quality of the reports. The fact that many more are reported on weekends is also dubious. Most of the reports are probably just a result of people with more free time on their hands. Also, given the huge numbers of mere “lights” that people see, there are just so many possible things they could be.

It’s also troubling that there are almost no more sightings of “cigar-shaped” or ‘disk-shaped’ flying saucers anymore. For my money, I would only trust sightings that occur in daylight and under good, clear weather conditions.

The truth, apparently, is still out there.

--

--