How to Explore US University Ranks 2017 at the Command-line? (Part I — Data Preview)
The ranking of universities has become a common task performed by many institutions, each of them proposes a different ranking based in several weighted categories. Examples of those rankings are: Webometrics Ranking of World Universities, THES — QS World Universities Rankings, Academic Ranking of World Universities and so on. The first ranking measures the visibility of the universities and their global performance in the web. The last two attempt to measure the performance of the universities based in categories like prizes received by members, citations, and publications. Employers, especially from the multinational organizations use rankings to find universities to source graduates, so attending a high-ranking university can help in a competitive job market.
In this project we will use a simple (publicly available) data set obtained from the data.world called: US News Universities Rankings 2017 edition. From this data, using Bash we will explore different features and finally find an interesting fact about the correlation of tuition fees and uni rank.
By completing this, you will learn to use the following Bash commands:
head — output the first part of files
tail — opposite to head
cat — concatenate and print files
sort — sort file contents
uniq — remove duplicate entries
Before we go any further, let’s setup our working environment by creating a folder on the Desktop. To do so, assuming we have a Linux based OS (e.g., Ubuntu) on our computer and let’s first fire up a command line and navigate to our analysis folder:
This will create a folder
unirankdata on your Desktop. Next, we download the data.
You should download the data from the web page below, as we have slightly simplified the data and let’s save the data as:
Data set Preview
This data set is small (toy) and we could in principle open it in a text editor or in Excel. However, real-world data sets are often larger and cumbersome to open in their entirety. Let’s assume as if it were a Big Data (and unstructured) and we want to get a sneak peak of the data. This is often the first thing to do when you get your hands on new data- previewing; it is important to get a sense for what it contains, how it is organized, and whether the data makes sense in the first place.
To help us get a preview of the data, we can use the command `head`, which as the name suggests, shows the first few lines of a file.
However, you will find the outputs are not very interesting on the first place, therefore we install a tool called
csvkit, which is a suite of command-line tools for converting to and working with CSV (install:
sudo pip install csvkit).
This will greatly help our future analyses. After we have installed the
csvkit, we re-run the
head command, but outputs piped
|, through the
csvlook command from the
head unirank.csv | csvlook
You should see the first
10 lines of the file output onto the screen, to see more than the first
10 lines, e.g. the first
25, use the
head -n 25 unirank.csv | csvlook
Here, the data set name
unirank.csv is a command-line argument that is given to the command
head and the
-n is an option which allows us to overwrite the
10-line default. Such command-line options are typically specified with a dash followed by a string, a space, and the value of the option (e.g.
-n 25 ). See the final outcome below (do you want to see it live?):
From the first 25 lines of the file, we can infer that the data is formatted as a file with separated values. From the first line (often called a header line) and the first few lines of data, we can infer the column contents:
Tuition and fees,
Undergrad Enrollment and
Rank . See you in the Part II 🚀 !
[This project is a part of the ‘Learn of Analyze Data in Bash Shell and Linux’ course.]