Preparing for Graduate School Interviews with Network Analysis

Leverage data science to uncover connections between faculty across institutions.

Danilo Trinidad Pérez-Rivera
Towards Data Science

--

Figure 1. A graph of faculty members at different institutions (color) connected by their publication history.

This article is the first part of a now on-going series about leveraging Data Science throughout the Graduate School Interview process. Check out the next installment on Topic Modelling, and be on the lookout for more.

Graduate School interview season is firing up in the United States, and that means you’ll meet tons of important people in a niche field that may know each other. As you prepare, you will start to realize just how small the world is, and how faculty across the country, and across the world, have built their careers in collaboration and share a mutual respect for each other. To help understand these relationships, I have spent the past few weeks thinking of how to best collect, process, and represent this data for effective usage during my interviews. Having just gotten back from my first set of interviews and shared some of my results with my peers, their interest has motivated me to write this short walk-through, so anyone interested on doing the same for themselves can.

We will mainly be working in Python, with the following modules:

  • Scholarly, for data collection through Google Scholar.
  • Pandas and Numpy, for basic data wrangling and operation.
  • Networkx and Matplotlib, for plotting and data representation.
  • Python-docx, for automated document construction.

Some of these may or may not be included in your typical installation of Python, for which reason I would refer you to their respective websites/documentation for dedicated assistance.

Section 0: Prepping for Data Collection

To my understanding and knowledge, there isn’t exactly a Google Scholar API that can be interfaced to obtain the data directly, for which reason Scholarly, is implemented. This brings some advantages and disadvantages, among these sometimes being blocked due to high volume usage. For this reason, we want to make some small changes to its script after installation. Please locate the scholarly.py file in your lib/site-packages/ folder under Python, or wherever you have decided to maintain this module. Once here we will add a simple random user-agent generator as implemented: here. Within scholarly.py you can find the definition of the web-scrapers headers as:

_HEADERS = {
‘accept-language’: ‘en-US,en’,
‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36’,
‘accept’: ‘text/html,application/xhtml+xml,application/xml’
}

This is good enough to function simply, however, depending on the amount of faculty you may be interested, this could be insufficient. Therefore, at the beginning of the script we will add the random user agent generator from the previously hyperlinked article, and the beginning of our scholarly.py script should now look like this:

def get_random_ua():
random_ua = ''
ua_file = 'C:/Users/Masters-PC/Documents/ua_file.txt'
try:
with open(ua_file) as f:
lines = f.readlines()
if len(lines) > 0:
prng = np.random.RandomState()
index = prng.permutation(len(lines) - 1)
idx = np.asarray(index, dtype=np.integer)[0]
random_proxy = lines[int(idx)]
except Exception as ex:
print('Exception in random_ua')
print(str(ex))
finally:
return random_ua
_GOOGLEID = hashlib.md5(str(random.random()).encode('utf-8')).hexdigest()[:16]
_COOKIES = {'GSP': 'ID={0}:CF=4'.format(_GOOGLEID)}
_HEADERS = {
'accept-language': 'en-US,en',
'User-Agent': get_random_ua(),
'accept': 'text/html,application/xhtml+xml,application/xml'
}

This should get us ready for our application, though you’re free to look through the rest of the hyperlinked article and implement the rest of their suggestions.

Section 1: Data Collection

Depending on how close you are to your interview date, you may or may not have a list of faculty you will meet with. This is where we will obtain the breadth of their data using Google Scholar. First, import the corresponding modules and define this author-list.

authorlist = """Wilhelm Ostwald
Linus Pauling
Max Perutz
Frederick Sanger
Michael Smith
Harold Urey
Milton Friedman
Friedrich Hayek"""

I’ve only including 7 random Nobel laureates, but the idea is that you should be able to include all of the faculty of your interest at once. My list included upwards of 40, spread across 10+ institutions, and I also included my previous mentors in case there were any previously undiscovered connections there. With this, we are ready to utilize scholarly and write the output to a .csv file we can check for errors to continue.

import scholarly
import csv
authors = authorlist.split('\n')
table = [['Name','ID','Affiliation','Domain','First Year Cited','Citations','Citations (5 Years)','cites_per_year','coauthors','hindex','hindex5y','i10index','i10index5y','interests','publications']]
for i in range(len(authors)):
print(i)
query = authors[i]
try:
search_query = scholarly.search_author(query)
author = next(search_query).fill()
data = [author.name,
author.id,
author.affiliation,
author.email,
'first cited year',
author.citedby,
author.citedby5y,
author.cites_per_year,
['coauthors'],
author.hindex,
author.hindex5y,
author.i10index,
author.i10index5y,
author.interests,
['publications']]
coauthors = []
for j in author.coauthors:
coauthors += [[j.name,j.affiliation]]
data[8] = coauthors
publications = []
for j in author.publications:
publications += [j.bib['title']]
data[-1] = publications
except:
print(query + ' is not available')
data = ['']*15
data[0] = query
table += [data]
with open('FoIScholardata.csv', 'w', newline='', encoding='utf-16') as myfile:
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
for i in table:
wr.writerow(i)

Once this runs, you should now have a FoIScholardata.csv file in your current working directory that be opened in Microsoft Excel, Google Sheets or any other spreadsheet editor to check the data.

Section 1a. Proof, clean and augment data.

The first thing you should do is double check that you obtained the RIGHT professor’s data, as those with shared or similar names may be confounded. If that is the case, you may want to include their institutions name/abbreviation in your search query to disambiguate them such as:

authorlist = "...
Linus Pauling CalTech
..."

The next problem you may notice is some professors data was not available. If this is not your case, you may continue to part 1b. The most likely cause of this is that they have not created their Google Scholar Profile. To get around this, we can augment our data with other sites including their institutional webpages or other websites such as ResearchGate. Feel free to create new columns of data if this interests you. Once you have done this, if you still need to obtain a list of their publications, you may do so with scholarly as well with the following code snippet. Replace the ‘users’ list with a list of names as abbreviated in the authors publications as found on Google Scholar.

import scholarly
import pandas as pd
import time
import random
table = pd.read_csv("FoIScholardata.csv", encoding='utf-16')
users = ['W Ostwald', 'MF Perutz']
names = list(table.Name)
surnames = []
for i in names:
surnames += [i.split(' ')[-1]]
for user in users:
print(user)
try:
srch=scholarly.search_pubs_query(user)
except:
print('Holding for an hour.')
time.sleep(3600)
pubs = []
noerror = True
while noerror:
time.sleep(0.5+random.random())
try:
pub = next(srch)
if user in pub.bib['author']:
pubs += [pub.bib['title']]
print(len(pubs))
except:
noerror = False

titles = []
for i in pubs:
titles += [i.bib['title']]
n = surnames.index(user.split(' ')[-1])
table.publications[n] = titles
export_csv = table.to_csv (r'fillins.csv', index = None, header=True) #Don't forget to add '.csv' at the end of the path
print('Holding for about a minute')
time.sleep(45+random.random()*30)

Modify the exported file name as your interest fits and add your results back to the original FOIScholardata.csv file. The sleep times included are to facilitate evasion of being blocked, but you may feel free to remove/modify them.

Section 1b: Determine the collaboration relationships between authors.

Google Scholar does include a list of coauthors, as approved by the profile holder, but it does seem to be a relatively new feature. As such, it may have been introduced after a user updated their profile, or their collaborator created theirs, etc., and may not include all actual coauthors. Therefore, we will manually be adding these collaborations based on sharing publications with the same title.

import pandas as pd 
import numpy as np
table = pd.read_csv("FoIScholardata.csv")
matrix = [[0]*len(table)]*len(table)
matrix = np.array(matrix)
newcol = []
for i in range(len(table)):
print(i)
author = table.Name[i]
pubs = []
collaborators = []
if type(table.coauthors[i])==str:
pubs = eval(table.publications[i])
for j in range(len(table)):
if i == j:
continue
if type(table.coauthors[j])==str:
coms = list(set(eval(table.publications[j])) & set(pubs))
matrix[i][j] = len(coms)
if matrix[i][j]>0:
collaborators += [table.Name[j]]
newcol += [collaborators]
table['collaborators'] = newcol
export_csv = table.to_csv (r'FoIScholardata.csv', index = None, header=True) #Don't forget to add '.csv' at the end of the path

Section 2: Generate main collaborator graphs.

Now, for our main event, though we definitely have much to do after this, and I hope you stick around. With the data duly organized, we will not utilize Networkx to construct and display our graph.

import math
import pandas as pd
import numpy as np
import networkx as nx
from matplotlib import cm
import matplotlib.pyplot as plt
table = pd.read_csv("FoIScholardata.csv")
G=nx.Graph()
for i in range(len(table)):
G.add_node(table.Name[i])
for i in range(len(table)):
for j in eval(table.collaborators[i]):
G.add_edge(table.Name[i],j)
domains = np.unique(table.Domain);
#see https://matplotlib.org/tutorials/colors/colormaps.html
viridis = cm.get_cmap('hsv', len(domains))
from matplotlib.pyplot import figure
figure(num=None, figsize=(16, 12), dpi=100, facecolor='w', edgecolor='k')
pos = nx.spring_layout(G, k = 0.25) # positions for all nodes
labels = { i : table.Name[i] for i in range(0, len(table.Name) ) }
options = {"alpha": 0.8}
for i in range(len(G.nodes())):
nx.draw_networkx_nodes(G, pos, nodelist=[table.Name[i]],node_size=table.hindex[i]*19,node_color=viridis(list(domains).index(table.Domain[i])), **options)
for i in pos:
angle = 0
plt.text(pos[i][0],pos[i][1],s=i,fontsize=12,rotation=angle, rotation_mode='anchor')
plt.axis('off')
plt.savefig("CollaboratorSpringGraphColor.png") # save as png
plt.close()

With some luck, your result should look a little like this:

Figure 2. Spring (Force-directed) graph of co-publishing scientists. Note: The connections between these scientists are not factually correct.

If you run into problems with the color formatting, you may want to manually modify the ‘Domain’ variable in your data. In this format, you can easily identify the sub-networks and other patterns that may emerge in this graph. Other formats are available, namely the one in our titular image, the circular graph. For this, we should slightly modify our code in three parts:

  1. Modify the position declaration:
pos = nx.circular_layout(G)

2. Rotate the name labels so that they are none-overlapping:

for j in range(len(pos)):
i = list(pos.keys())[j]
angle = 0
if ((math.atan2(pos[i][1],pos[i][0])*180/math.pi) > (70)) & ((math.atan2(pos[i][1],pos[i][0])*180/math.pi) < (110)):
angle = 45
elif ((math.atan2(pos[i][1],pos[i][0])*180/math.pi) > (-110)) & ((math.atan2(pos[i][1],pos[i][0])*180/math.pi) < (-70)):
angle = -45

3. Change the output file name:

plt.savefig("CollaboratorCircleGraphColor.png")

The resultant image should look like this:

Figure 3. Labeled circular graph demonstrating interconnections between selected faculty. Note: The connections between these scientists are not factually correct.

Section 3: Derive additional individualized data per researcher.

Now, that graph is good for a big picture. However, you might want to break it down significantly for the purpose of thinking about each professor as you approach the date of your invite. First, we will build on Google Scholars notion of “interests”. As we have become accustomed, this feature may or may not be completed by the profile holder. However, we can build a substitute by looking at the publication titles once again. What we will do is extract word frequencies for all of the publications, and see what rises to the top. If not installed, you may also want to install the get_stop_words module.

import pandas as pd
import matplotlib.pyplot as plt
from stop_words import get_stop_words
excs = get_stop_words('en')
table = pd.read_csv("FoIScholarData.csv")
def valuef(x):
return worddict[x]
newcol = []
for i in range(len(table)):
pubs = eval(table.publications[i])
worddict = {}
for j in pubs:
words = j.lower().split(' ')
for k in words:
if k in worddict.keys():
worddict[k] += 1
else:
worddict[k] = 1
values = []
topics = []
for j in worddict:
if worddict[j] > 5:
if not (j in excs):
values += [worddict[j]]
topics += [j]
width = 0.3
topics.sort(key=valuef)
values.sort()
newcol += [topics]
plt.bar(topics[-15:], values[-15:], width, color='b')
plt.xticks(rotation = -45)
plt.gcf().subplots_adjust(bottom=0.22)
plt.title('Title Word Frequency for ' + str(table.Name[i]))
plt.savefig(str(table.Name[i]) + "_WF.png") # save as png
plt.close() # display
table['topics'] = newcol
export_csv = table.to_csv (r'C:\Users\Masters-PC\Desktop\ProfNetworks\FoIScholardata.csv', index = None, header=True) #Don't forget to add '.csv' at the end of the path

As well as giving us a new column in our data spreadsheet, this will also produce a visualization for each professor as follows:

Figure 3. Word frequency histogram for titles of publications by Dr. Charles Geier of Pennsylvania State University.

This graph is immediately informative, as we can identify the researchers interest in inhibitory control and cognitive development in general. Perhaps, a deeper analysis including topic modeling to derive deeper connections between researchers would be possible, though this is beyond the scope of this current article.

Update: I’ve decided to follow through on this promise, and effectively did execute a topic modelling approach to connect these sections. Feel free to check it out!

Section 4: Produce closer looks at graphs per researcher.

The large graph we created may not be fit for an individualized report for each researcher. Therefore, we may be interested in extracting just the sub-networks, similar to what was visualized in the force-directed graphs. For this, we will look at their first-order collaborators (direct), and second-order collaborators (collaborators of collaborators).

import math
import pandas as pd
import numpy as np
import networkx as nx
from matplotlib import cm
import matplotlib.pyplot as plt
table = pd.read_csv("FoIScholardata.csv")
def f(n, c2, c1):
m = (c2-c1)/4
b = c2 - m*5
return m*n+b
for qt in range(len(table.Name)):
G=nx.Graph()
# adding just one node:
G.add_node(table.Name[qt])
for j in eval(table.collaborators[qt]):
G.add_node(j)
G.add_edge(table.Name[qt],j)
if len(eval(table.collaborators[list(table.Name).index(j)]))>0:
for k in eval(table.collaborators[list(table.Name).index(j)]):
G.add_node(k)
G.add_edge(j,k)
N = len(eval(table.collaborators[qt]))
domains = np.unique(table.Domain);
#see https://matplotlib.org/tutorials/colors/colormaps.html
viridis = cm.get_cmap('hsv', len(domains))
from matplotlib.pyplot import figure
figure(num=None, figsize=(16, f(N, 12, 4)), dpi=100, facecolor='w', edgecolor='k')
#pos = nx.spring_layout(G, 1/(len(G.nodes))) # positions for all nodes
#pos = nx.circular_layout(G) # positions for all nodes
pos = nx.kamada_kawai_layout(G) # positions for all nodes

labels = { i : list(G.nodes())[i] for i in range(0, len(G.nodes())) }
#nx.draw(G)
#options = {'node_size': 500, "alpha": 0.8}
options = {"alpha": 0.8}
for i in range(len(G.nodes())):
nx.draw_networkx_nodes(G, pos, nodelist=[list(G.nodes())[i]],node_size=table.hindex[list(table.Name).index(list(G.nodes())[i])]*19,node_color=viridis(list(domains).index(table.Domain[list(table.Name).index(list(G.nodes())[i])])), **options)
nx.draw_networkx_edges(G, pos, width=1.0, alpha=0.5)
for i in pos:
surname = i.split(' ')[-1]
plt.text(pos[i][0],pos[i][1],s=i,fontsize=f(N, 36, 48))
plt.axis('off')
xlims1 = plt.xlim()
plt.xlim(xlims1[0],f(N, 2, 2.5))
#plt.ylim(-3, 3)
plt.savefig(table.Name[qt]+ "CollaboratorSpringGraphColor.png") # save as png
plt.close() # display

I guess it’s worth discussing that self-defined ‘f’ function. Not all researchers will have the same number of collaborators, therefore, I linearly calibrated this ‘f’ function for each entry I understood should scale appropriately with the image created so it’ll be aesthetically pleasing when we get to the next section, producing our reports. The numbers may be a bit arbitrary, so feel free to tweak them to your complacence.

Section 5: Execute individualized reports per researcher.

With all of this data visualized and ready to go, we can now proceed to finalize our summarized reports. For this, we will put Python-docx to use.

from docx import Document
from docx.shared import Inches
from docx.enum.table import WD_ALIGN_VERTICAL
import pandas as pd
import numpy as np
import networkx as nx
from matplotlib import cm
import matplotlib.pyplot as plt
def f(n, c2, c1):
m = (c2-c1)/4
b = c2 - m*5
return m*n+b
table = pd.read_csv("FoIScholardata.csv", encoding='utf-16')for qt in range(len(table.Name)):
print(qt)
document = Document()
try:
if type(table.Affiliation[qt])==float:
continue
document.add_heading(table.Name[qt], 0)

p = document.add_paragraph('')
p.add_run(table.Affiliation[qt]).italic = True
p.add_run('\n'+'Interests '+str(table.interests[qt])[1:-1])
document.add_picture(table.Name[qt]+"_WF.png", width=Inches(5))
if len(eval(table.collaborators[qt]))>0:
document.add_heading('Collaborators', level=1)
coltable = document.add_table(rows=1, cols=2)
coltable.cell(0, 0).vertical_alignment = WD_ALIGN_VERTICAL.CENTER
for j in eval(table.collaborators[qt]):
pol = coltable.rows[0].cells[0].add_paragraph(
j, style='List Bullet')

co = coltable.rows[0].cells[1].add_paragraph()
bo = co.add_run()
bo.add_picture(table.Name[qt]+"CollaboratorSpringGraphColor.png", width=Inches(f(N, 4.9, 4)))
cols = []
ncols = []
bestcol = []
for j in eval(table.collaborators[qt]):
k = list(table.Name).index(j)
collabs = list(set(eval(table.publications[qt])) & set(eval(table.publications[k])))
cols += [table.Name[k]]
ncols += [len(collabs)]
bestcol += [collabs[0]]


ntable = document.add_table(rows=1, cols=3)
hdr_cells = ntable.rows[0].cells
hdr_cells[0].text = 'Collaborator'
hdr_cells[1].text = 'Title'
hdr_cells[2].text = 'N'
for k in range(len(cols)):
row_cells = ntable.add_row().cells
row_cells[0].text = cols[k]
row_cells[0].width = Inches(1)
row_cells[1].text = bestcol[k]
row_cells[1].width = Inches(5)
row_cells[2].text = str(ncols[k])
row_cells[2].width = Inches(0.5)

margin = Inches(0.5)
sections = document.sections
for section in sections:
section.top_margin = margin
section.bottom_margin = margin
section.left_margin = margin
section.right_margin = margin

document.add_page_break()

document.save('ScholarSheets/'+table.Name[qt]+'.docx')
except:
print(str(qt) + 'not')

This will leave us with a final Word Document that could look a little like this:

Figure 4. Example output document for Dr. Charles Geier.

With this document in hand, we now have a succinct summary of data on our professor, and can begin to redact notes and execute a more directed search of the literature to select as part of your review during the interview process. We can kill two birds with one stone by reading a shared paper between professors, formulate potential questions, and jot down notes. If you made it this far, thanks for your support, and I hope you the best this season! If you would like to dig even deeper, don’t forget to check out part 2 on Topic Modelling.

--

--

Continuously curious. Dean’s Fellow at NYU. University of Puerto Rico at Cayey alumni. Network analysis -mainly neurons-, sometimes people.