How to Explore US University Ranks 2017 at the Command-line? (Part I — Data Preview)

Keep exploring data @ Bash shell, image adopted with thanks from unsplash.com

The ranking of universities has become a common task performed by many institutions, each of them proposes a different ranking based in several weighted categories. Examples of those rankings are: Webometrics Ranking of World Universities, THES — QS World Universities Rankings, Academic Ranking of World Universities and so on. The first ranking measures the visibility of the universities and their global performance in the web. The last two attempt to measure the performance of the universities based in categories like prizes received by members, citations, and publications. Employers, especially from the multinational organizations use rankings to find universities to source graduates, so attending a high-ranking university can help in a competitive job market.

In this project we will use a simple (publicly available) data set obtained from the data.world called: US News Universities Rankings 2017 edition. From this data, using Bash we will explore different features and finally find an interesting fact about the correlation of tuition fees and uni rank.

Learning objectives
By completing this, you will learn to use the following Bash commands:

head — output the first part of files
tail — opposite to head cat — concatenate and print files
sort — sort file contents
uniq — remove duplicate entries

Before we go any further, let’s setup our working environment by creating a folder on the Desktop. To do so, assuming we have a Linux based OS (e.g., Ubuntu) on our computer and let’s first fire up a command line and navigate to our analysis folder:

cd ~/Desktop
mkdir unirankingdata
cd unirankingdata

This will create a folder unirankdata on your Desktop. Next, we download the data.

You should download the data from the web page below, as we have slightly simplified the data and let’s save the data as: unirank.csv

wget https://www.scientificprogramming.io/datasets/unirank.csv

Data set Preview
This data set is small (toy) and we could in principle open it in a text editor or in Excel. However, real-world data sets are often larger and cumbersome to open in their entirety. Let’s assume as if it were a Big Data (and unstructured) and we want to get a sneak peak of the data. This is often the first thing to do when you get your hands on new data- previewing; it is important to get a sense for what it contains, how it is organized, and whether the data makes sense in the first place.

To help us get a preview of the data, we can use the command `head`, which as the name suggests, shows the first few lines of a file.

head  unirank.csv

However, you will find the outputs are not very interesting on the first place, therefore we install a tool called csvkit, which is a suite of command-line tools for converting to and working with CSV (install: sudo pip install csvkit).

This will greatly help our future analyses. After we have installed the csvkit, we re-run the head command, but outputs piped |, through the csvlook command from the csvkit suit:

head  unirank.csv | csvlook

You should see the first 10 lines of the file output onto the screen, to see more than the first 10 lines, e.g. the first 25, use the -n option:

head -n 25 unirank.csv | csvlook

Here, the data set name unirank.csv is a command-line argument that is given to the command head and the -n is an option which allows us to overwrite the 10-line default. Such command-line options are typically specified with a dash followed by a string, a space, and the value of the option (e.g. -n 25 ). See the final outcome below (do you want to see it live?):

The unirank.csv data set preview (head -n 25 unirank.csv)

From the first 25 lines of the file, we can infer that the data is formatted as a file with separated values. From the first line (often called a header line) and the first few lines of data, we can infer the column contents: Name, City, State, Tuition and fees, Undergrad Enrollment and Rank . See you in the Part II 🚀 !

[This project is a part of the ‘Learn of Analyze Data in Bash Shell and Linux’ course.]

Related Works