Solving real-world problem using data science

Published in

Towards Data Science

7 min readOct 18, 2018

The world of data science is evolving every day. Every professional in this field needs to be updated and constantly learning, or risk being left behind. You must have an appetite to solve problems. So I decided to study and solve a real-world problem which most of us have faced in our professional careers. The technical round in an interview!

How many times have you gone through a technical interview where you feel you’re acing it, and then a question comes that leaves you stumped? And from there the entire interview goes downhill because now you have lost confidence and the recruiter has lost interest.

But is it fair to judge the technical capabilities of a candidate based entirely on a 3-hour interview? This is a loss at both the ends because now the company has lost a potential candidate and the candidate has lost an opportunity.

If only there was a way through which the recruiter can get the gist about the technical capabilities of the candidate outside the interview hall.
A scoring system of sorts — that would give an ideal score to gauge the technical knowledge of the candidate, and thereby help the recruiter to make an informed, unbiased decision. Sounds like a dream scenario, right?

So I decided to start a project called “Scorey” that aims to crack this challenge.

Scorey helps in scraping, aggregating, and assessing the technical ability of a candidate based on publicly available sources.

Setting the Problem Statement

The current interview scenario is biased towards “candidate’s performance during the 3-hour interview” and doesn’t take other factors into account, such as the candidate’s competitive coding abilities, contribution towards the developer community, and so on.

The Approach We’ll Take

Scorey tries to solve this problem by aggregating publicly available data from various websites, such as:

Github
StackOverflow
CodeChef
Codebuddy
Codeforces
Hackerearth
SPOJ
GitAwards

Once the data is collected, the algorithm then defines a comprehensive scoring system that grades the candidates technical capabilities based on the following factors:

Ranking
Number of Problems Solved
Activity
Reputation
Contribution
Followers

The candidate is then assigned a scored out of 100. This helps the interviewer get a full view of a candidate’s abilities and hence make an unbiased, informed decision.

Setting up the Project

For the entire scope of this project, we are going to use Python, a Jupyter notebook & scraping libraries. So if you’re someone who likes Notebooks, then this section is for you. If not, feel free to skip this and move on to the next section.

def color_negative_red(val):
    """
    Takes a scalar and returns a string with
    the css property `'color: red'` for negative
    strings, black otherwise.
    """
    color = 'red' if val > 0 else 'black'
    return 'color: %s' % color( Set CSS properties for th elements in dataframe)
th_props = [
  ('font-size', '15px'),
  ('text-align', 'center'),
  ('font-weight', 'bold'),
  ('color', '#00c936'),
  ('background-color', '##f7f7f7')
  ]

# Set CSS properties for td elements in dataframe
td_props = [
  ('font-size', '15px'),
    ('font-weight', 'bold')
  ]

# Set table styles
styles = [
  dict(selector="th", props=th_props),
  dict(selector="td", props=td_props)
  ]

This will make your dataframe output look neat, tidy, and really good!

Now that we have a gist of what we are aiming to solve and how we are going to go about it, let’s code!

Step 1: Scraping Personal Website

We need to aggregate the entire “coding presence of a person on the internet”. But where do we start? Duh! His/Her personal website. This is of course assuming we have access and permission to the candidate’s personal website. We can parse all the necessary links from there.

from bs4 import BeautifulSoup
import urllib
import urllib.parse
import urllib.request
from urllib.request import urlopen

url = input('Enter your personal website - ')

html = urlopen(url).read()
soup = BeautifulSoup(html, "lxml")

tags = soup('a')

for tag in tags:
    print (tag.get('href',None))

When we run this piece of code, we get the below output:

Here we are using BeautifulSoup which is a popular scraping library. Using this block of code, we have direct links to a candidate’s online profile.

Now where will you begin if you had to assess a coder at a much more granular level?

Github

So first, let us use Github API to get all the info we need of a particular user.

For our use case, we only need email, number of repositories, followers, hireable (true or false), current company and last recorded activity.

2. StackOverflow

Ah. Devs might not believe in God but StackOverflow is definitely a temple for them. As you may already know, it is very difficult to get a reputation on StackOverflow. For this step, you can use StackExchange’s API — it gives you user data such as reputation, no. of answers, etc.

We’ll then add these new attributes to our existing dataframe.

Now we are going to target and scrape global competitive programming platforms such as CodeChef, SPOJ, Codebuddy, Hackerearth, CodeForces & GitAwards (for a deeper insight into their projects).

All this scraping gave us a LOT of info as you can see. Code is pretty self explanatory. I’ve also documented using comments so that its easy to understand.

Without going into the nitty-gritty of the code, I’d like to focus on the process. But you can give me a shout-out if you face any trouble executing the code. :) Now that we have all the data in hand, we will move on to creating a scoring algorithm.

Step 2: Scoring System

The next part is to score the candidates on the following parameters:

Rank (25 points)
Number of problems solved (25 points)
Reputation (25 points)
Followers (15 points)
Activity (5 points)
Contributions (5 points)

So if you go through this piece of code, you’d understand how we can create a scoring system. Though its pretty basic at this point, we can use machine learning to create a robust dynamic scoring system.

Final Score

Based on the point system we saw above, the algorithm will now assign a final score to the candidate’s technical capabilities.

So the user poke19962008 has a score of 64 out of 100! Now this will give the recruiters an idea of the technical abilities of the candidate outside the interview room.

Step 3: Predictive Modeling

When you are trying to solve a real world problem and “productivize” the solution, its important to consider the requirements of the end user.
In this case, its the recruiter.
How can we use power of machine learning to add value to the recruiter?

Upon brainstorming, I found the following use cases —
1. Model that predicts whether or not the management will be satisfied by candidate’s skill set
2. Model that predicts the probability of a candidate’s churn post hiring
3. Using genetic algorithm to link assign the candidate to respective team

Let’s try to code the 1st use case — Predicting company’s satisfaction
Assuming that the recruiter has been using Scorey to screen candidates for some time and now has a database of 100 candidates.
Post recruitment, based on the candidate’s performance, the recruiter updates the database with a new binary attribute “Satisfaction” with values of either 0 or 1.
Let’s create a dummy database for now and try to create a model using Scikit-Learn, Pandas, Numpy and build a predictive model.

Import data & libraries
Clean the data — remove duplicates and null values
Using label encoder to deal with categorical data
Split the dataset into train & test
Using kNN classifier to predict
Check Accuracy

Using these steps, you will get a niche model that will be able to predict whether the candidate will fit into the company based on underlying trends.
For eg — Candidates who have higher reputation and are contributing to Open source are more likely to retain for a longer period of time.

Step 4: Dashboarding

I went ahead and made a dashboard. This its still a work-in-progress and I’ll be happy to share some of the screenshots of the interface.

There you have it. Your very own end-to-end product.
To summarize —
1. We identified a problem
2. Methodical thinking on how we can solve it
3. Used Web scraping to gather data
4. Build an algorithmic scoring system
5. Machine learning to build a predictive model
5. Dashboard to communicate results

Tech stack that we used — Python: BeautifulSoup, Urllib, Pandas, Sklearn

So that’s all for this article. We took a real life problem and tried to use data and algorithms to solve it!

Next time you go for an interview, you can pitch this system to the recruiter. :)

So what’s next?

Code for the entire project can be found on Github — here

Integrating Machine learning components for rule generation
Handling missing data exceptions dynamically

If you think this project is cool and would like to contribute, you are more than welcome! Let’s build something exciting for the community.

You can connect with me over LinkedIn or on Twitter to get daily updates on what’s new in data science & machine learning.