Mapping South America with R: A Deep Dive into Geo-Visualization

Navigating datasets, geopolitical nuances, and coding challenges to paint a comprehensive picture of the continent

Fernando Barbalho
Towards Data Science

--

Photo by Alexander Schimmeck on Unsplash

So you are that kind of data scientist and amateur Medium writer who has enjoyed maps and geography since childhood. You are searching for a good theme for your subsequent work with graphs and, most certainly, maps when you realize that the official statistics agency of your country, Brazil, released the most recent census data. Why not? Why not take a picture of Brazil compared to its neighbors in South America? It might be a simple task using R and all its good packages. Let’s do it.

The minute after this decision comes the realization that the simple task is indeed a hero’s journey with elements such as discovering the most suitable dataset with shapefiles, lack of information, shapefiles interoperability, latitude and longitude mathematics, cultural differences in Geography concepts, and even geopolitical issues, like understanding how to put French overseas territories’ map and data correctly in South America.

The next paragraphs explain one of some possible paths to paint demographic information in a delimited portion of a world map. The step-by-step described below might be useful for all those interested in international comparison with a Geo-visualization approach, even if one’s purpose is to compare access to water among African countries or obesity rates in North America.

Pachamama

Let’s start with the whole picture: an R version of mapa-mundi. See the image and code below.

Mapa Mundi: Image by author
library(readxl)
library(geobr)
library(tidyverse)
library(sf)
library(spData)
library(ggrepel)
library(colorspace)

data("world")


#mapa mundi


world %>%
ggplot() +
geom_sf(aes(fill=pop/10^6)) +
scale_fill_continuous_sequential(palette= "Heat 2" )+
theme_void() +
theme(
panel.background = element_rect(fill="#0077be")
) +
labs(
fill= str_wrap("População em milhões de habitantes", 10)
)

I use the package {spData} as a reference for a dataframe with geometry information for territories shapefiles all around the planet. The aes function uses the population information to fill the shapes. As we know, China and India are the most populated countries in the world, with over 1 billion people each. The heat colors show the contrast with all other countries. Most of the sequential colors are weak. We can barely understand the gradient of colors in the picture. The logarithm is the best alternative if you want a glimpse of a better color distribution. See below.

Mapa mundi with log scale. Image by author
world %>%
ggplot() +
geom_sf(aes(fill=pop)) +
scale_fill_continuous_sequential(palette= "Heat 2", trans= "log2" )+
theme_void() +
theme(
panel.background = element_rect(fill="#0077be"),
legend.position = "none"
)

In the code, you can see the logarithm transformation in the scale_fill_continuous_sequential function.

In the world dataframe structure, there is a Continent column. So, filtering the data using this column to get a South American map is obvious. See the code and, right after, the map.

South America map: first version. Image by author
world %>%
filter(continent == "South America") %>%
ggplot() +
geom_sf(aes(fill=pop/10^6)) +
scale_fill_continuous_sequential(palette= "Heat 2" )+
theme_void() +
theme(
panel.background = element_rect(fill="#0077be")
) +
labs(
fill= str_wrap("População em milhões de habitantes", 10)
)

As you can see, the dplyr filter function worked fine; this is just the map we wanted to see. But is it really correct?

Climate change is a huge issue, but the sea levels haven’t risen yet with such a volume to drown a pronounced area that used to appear in North of South America. What happened here? Let’s draw another map now with the help of coordinates and naming the polygons.

South America map: second version. Image by author
southamerica<-
world %>%
filter(continent=="South America")



southamerica$lon<- sf::st_coordinates(sf::st_centroid(southamerica$geom))[,1]
southamerica$lat<- sf::st_coordinates(sf::st_centroid(southamerica$geom))[,2]

southamerica %>%
ggplot() +
geom_sf(aes(fill=pop/10^6)) +
scale_fill_continuous_sequential(palette= "Heat 2" )+
theme_light() +
theme(
panel.background = element_rect(fill="#0077be")
) +
labs(
fill= str_wrap("População em milhões de habitantes", 10)
)+
geom_text_repel(aes(x=lon, y=lat, label= str_wrap(name_long,20)),
color = "black",
fontface = "bold",
size = 2.8)

The theme_light instead of theme_void was sufficient to display the coordinates. The polygon naming took more work. We had to calculate the centroid of each polygon and then use this information as x and y coordinates in a geom_text_repel function.

With this new map version and some previous knowledge, we discovered the missing territory was French Guyana, between 0º and 10º north latitude and 53º and 55º west longitude. Our next quest is understanding how to get pieces of information on French Guyana: polygon, population, and some coordinates to fill our map.

La Mer

I had to isolate France from the rest of the world to understand how the {spData} package dealt with this country map’s data. See the result below.

A map of France. Image by author
france<-
world %>%
filter(iso_a2 == "FR")


france %>%
ggplot() +
geom_sf(aes(fill=pop)) +
scale_fill_continuous_sequential(palette= "Heat 2", trans= "log2" )+
theme_light() +
theme(
#panel.background = element_rect(fill="#0077be")
) +
labs(
fill= str_wrap("População", 30)
)

France has many so-called overseas territories. The approach of the {spData} package was to represent only the main territory, plus Corsica, an island in the Mediterranean Sea, and French Guyana, located precisely in the coordinate range that characterizes the gap in our last map of South America.

My next try was to add the dataframe with France geometry data to my South America filter, but I knew I would need more. See below

South America + France. Image by author
southamerica %>%
bind_rows(france) %>%
ggplot() +
geom_sf(aes(fill=pop/10^6)) +
scale_fill_continuous_sequential(palette= "Heat 2" )+
theme_light() +
theme(
panel.background = element_rect(fill="#0077be")
) +
labs(
fill= str_wrap("População em milhões de habitantes", 10)
)+
geom_text_repel(aes(x=lon, y=lat, label= str_wrap(name_long,20)),
color = "black",
fontface = "bold",
size = 2.8)

As you can see in the code, I used bind_row to combine South American territories with France shapefile. So we had now the French Guyana well positioned. On the other hand, there is no population information on the map, and France is a part of South America on the reverse of colonialism’s history.

In other words, I wanted this map.

French Guyana is on the South America map. Image by author
data_guiana<-
insee::get_idbank_list('TCRED-ESTIMATIONS-POPULATION') %>%
filter(str_detect(REF_AREA_label_fr,"Guyane")) %>%
filter(AGE == "00-") %>% #all ages
filter(SEXE == 0) %>% #men and women
pull(idbank) %>%
insee::get_insee_idbank() %>%
filter(TIME_PERIOD == "2023") %>%
select(TITLE_EN,OBS_VALUE) %>%
mutate(iso_a2 = "FR")

data_guiana <- janitor::clean_names(data_guiana)

southamerica %>%
bind_rows(france) %>%
left_join(data_guiana) %>%
mutate(pop=ifelse(iso_a2=="FR",obs_value,pop))%>%
mutate(lon= ifelse(iso_a2=="FR", france[[11]][[1]][[1]][[1]][1,1], lon),
lat= ifelse(iso_a2=="FR",france[[11]][[1]][[1]][[1]][1,2], lat)) %>%
ggplot() +
geom_sf(aes(fill=pop/10^6)) +
scale_fill_continuous_sequential(palette= "Heat 2" )+
geom_text_repel(aes(x=lon, y=lat, label= str_wrap(name_long,20)),
color = "black",
fontface = "bold",
size = 2.8)+
coord_sf(xlim = c(-82,-35), ylim=c(-60,15))+
theme_light() +
theme(
panel.background = element_rect(fill="#0077be")
) +
labs(
fill= str_wrap("População em milhões de habitantes", 10)
)

As you can read, I used an R package produced by France’s official statistics office to obtain the population of Guyana. In addition, I limited the map to the appropriate coordinates to see South America.

Emolduram e aquarelam o meu Brasil

Now that the map hero finally resolved the South American issues and played the pipes of peace with France, it’s time to return to Brazilian data and maps. Remember, I want to compare some Brazilian census details with other countries and territories south of Panama.

The data census is available in an R package or at an API address. I did the more challenging option using the API. Using the other option another time might be a good idea. See the code and the map below, where I show the population of the Brazilian states in contrast to the other South American territories.

South America + Brazilian states. Image by author
central_america<-
world %>%
filter(subregion == "Central America")



brasil<- geobr::read_country()
estados<- geobr::read_state()

#dados de população

ibge2022<-
get_municipalies_data()

estados<-
estados %>%
inner_join(
ibge2022 %>%
rename(abbrev_state = uf) %>%
summarise(.by=abbrev_state,
pop = sum(populacao_residente)
)
)

southamerica %>%
filter(iso_a2!="BR") %>%
bind_rows(france) %>%
left_join(data_guiana) %>%
mutate(pop=ifelse(iso_a2=="FR",obs_value,pop))%>%
mutate(lon= ifelse(iso_a2=="FR", france[[11]][[1]][[1]][[1]][1,1], lon),
lat= ifelse(iso_a2=="FR",france[[11]][[1]][[1]][[1]][1,2], lat)) %>%
ggplot() +
geom_sf(aes(fill=pop/10^6)) +
geom_sf(data=estados, aes(fill=pop/10^6)) +
geom_sf(data=brasil,fill=NA, color="#00A859", lwd=1.2)+
geom_sf(data= central_america,fill= "#808080")+
scale_fill_continuous_sequential(palette= "Heat 2" )+
geom_text_repel(aes(x=lon, y=lat,
label= str_wrap(name_long,20)),
color = "black",
fontface = "bold",
size = 2.8)+
coord_sf(xlim = c(-82,-35), ylim=c(-60,15))+
theme_void() +
theme(
panel.background = element_rect(fill="#0077be")
) +
labs(
fill= str_wrap("População em milhões", 10)
)

I wrote the function get_municipalites_data using the API cited above. The code is available in my gist. Note also two functions that provide the shapefiles used to draw the Brazilian and its sub-region limits: read_country and read_states. These functions are present at the {geobr} package.

I used another filter from the world dataframe. In this case, the purpose is to show the beginning of the Central American subcontinent and paint its map with a shade of gray. Here, we faced a cultural divergence as we learned in Brazil that the Americas have three sub-continents: North America, Central America, and South America. For the authors of the dataset, Central America is a sub-region of North America.

Now it’s time to finish my work. I want to show the names of the eight most populous territories on the map. Even in this final sprint, there were a few code tricks.

Most populated territories. Image by author
estados$lon<- sf::st_coordinates(sf::st_centroid(estados$geom))[,1]   
estados$lat<- sf::st_coordinates(sf::st_centroid(estados$geom))[,2]


most_populated<-
southamerica %>%
filter(iso_a2 !="BR") %>%
rename(name= name_long) %>%
as_tibble() %>%
select(name, pop, lat, lon) %>%

bind_rows(
estados %>%
rename(name= name_state) %>%
as_tibble() %>%
select(name, pop, lat, lon)
) %>%
slice_max(order_by = pop, n=8)

southamerica %>%
filter(iso_a2!="BR") %>%
bind_rows(france) %>%
left_join(data_guiana) %>%
mutate(pop=ifelse(iso_a2=="FR",obs_value,pop))%>%
mutate(lon= ifelse(iso_a2=="FR", france[[11]][[1]][[1]][[1]][1,1], lon),
lat= ifelse(iso_a2=="FR",france[[11]][[1]][[1]][[1]][1,2], lat)) %>%
ggplot() +
geom_sf(aes(fill=pop/10^6)) +
geom_sf(data=estados, aes(fill=pop/10^6)) +
geom_sf(data=brasil,fill=NA, color="#00A859", lwd=1.2)+
geom_sf(data= central_america,fill= "#808080")+
scale_fill_continuous_sequential(palette= "Heat 2" )+
geom_text_repel(data= most_populated,
aes(x=lon, y=lat,
label= str_c(str_wrap(name,10),": ",round(pop/10^6,1))),
color = "black",
fontface = "bold",
size = 2.9)+
coord_sf(xlim = c(-82,-35), ylim=c(-60,15))+
theme_void() +
theme(
panel.background = element_rect(fill="#0077be")
) +
labs(
fill= str_wrap("População em milhões", 10)
)

Three Brazilian states are among the eight most populated territories in South America. In fact, São Paulo is the second most inhabited space on the map, surpassing all the countries except Colombia.

Now, focusing on the code, you can see that I created a new dataframe to build this ranking by combining two different sf objects. I selected a sub-set of columns and changed the type from sf to tibble to enable the row binding.

So this is it. The hero has completed one possible path and left the footprints for the next journey. Now it’s your turn. Remember all of your projects that could have a significant improvement using a map representation. Using the walk-through above and gathering all the data available about population, socioeconomic issues, and so on, it’s just a matter of choosing the variable to fill the polygons.

Code and data

The complete code is available at gist.

All Brazilian datasets are characterized as public domain since they are data produced by federal government institutions, made available on the internet as active transparency, and are subject to the Brazilian Access to Information Law.

IBGE: Brazilian census Data

IPEA: Brazilian shapefiles

The French data is available from the Open Data Portal and is listed as Open License which allows explore the information for commercial purposes.

--

--

Doctor in Business Administration from UNB (2014). As data scientist, researches and implements products for transparency in the Brazilian public sector.