Recommending GitHub Repositories with Google BigQuery and the implicit library

Juarez Bochi
Towards Data Science
5 min readJun 24, 2017

--

Keeping track of all the great repositories that are published in GitHub is an impossible task. The trending list does not help much. As you might have read, sorting by popularity is not as easy as it looks. And most of the stuff listed there is not usually related to the stack I use.

That’s a great use case for Recommender Systems. They help us filter information — repositories — , in a personalized way. I think of them as a search query when we don’t know what to type.

There are several approaches to recommender systems, but they are usually divided into two broad categories, collaborative and content-based filtering. Here’s the Wikipedia definition:

Collaborative filtering approaches building a model from a user’s past behaviour (items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users. Content-based filtering approaches utilize a series of discrete characteristics of an item in order to recommend additional items with similar properties.

In this particular case, it’s hard to apply a content-based approach because it’s not straightforward to measure repositories’ similarity by their content: code, documentation, tags, etc. Collaborative Filtering is more suitable and easier to apply. As I said in my previous post, even if people are not rating your content, implicit feedback can be more than enough.

In this case, we can use as implicit feedback repository stars, the current currency of programmers’ coolness:

Stolen from a talk that Julia and I gave at Porto Alegre Machine Learning Meetup (in Portuguese)

We are going to get a random sample of stars that were given in the current month from Google Big Query, and use the amazing implicit library that implements the brilliant paper Collaborative Filtering for Implicit Feedback Datasets. I’ll not talk about the algorithm itself, but you can read the paper or this blog post from Ben Frederickson, the author of implicit.

Here is the query to get the data:

WITH stars AS (
SELECT actor.login AS user, repo.name AS repo
FROM githubarchive.month.201706
WHERE type="WatchEvent"
),
repositories_stars AS (
SELECT repo, COUNT(*) as c FROM stars GROUP BY repo
ORDER BY c DESC
LIMIT 1000
),
users_stars AS (
SELECT user, COUNT(*) as c FROM stars
WHERE repo IN (SELECT repo FROM repositories_stars)
GROUP BY user HAVING c > 10 AND C < 100
LIMIT 10000
)
SELECT user, repo FROM stars
WHERE repo IN (SELECT repo FROM repositories_stars)
AND user IN (SELECT user FROM users_stars)

Notice that I’m filtering the top 1,000 repositories and getting 10,000 random users that gave between 10 and 100 stars to the top repositories. We want to sample people that are following the trending items, but we don’t want to get users that are giving stars to everything because they wouldn’t add much information.

It’s important to realize that we don’t need all the stars from all the users to generate recommendations to everybody. Adding more data would increase the recommendation quality, but it would also increase the training time. If we sample correctly, we will not hurt the model precision.

Okay, enough talking, how do we get the data and train the model?

data = pd.io.gbq.read_gbq(query, dialect="standard", project_id=project_id)# map each repo and user to a unique numeric value
data['user'] = data['user'].astype("category")
data['repo'] = data['repo'].astype("category")
# create a sparse matrix of all the users/repos
stars = coo_matrix((np.ones(data.shape[0]),
(data['repo'].cat.codes.copy(),
data['user'].cat.codes.copy())))
# train model
model = AlternatingLeastSquares(factors=50,
regularization=0.01,
dtype=np.float64,
iterations=50)
confidence = 40
model.fit(confidence * stars)

That’s all. Only 7 lines of Python. And it’s insanely fast. The data is pulled and the model trained in less than 10 seconds. I’ve chosen parameters that usually work well, but if we were serious about it, we should do some validation. Let’s skip this and go directly to the results. What’s similar to tensorflow?

# dictionaries to translate names to ids and vice-versa
repos = dict(enumerate(data['repo'].cat.categories))
repo_ids = {r: i for i, r in repos.iteritems()}
model.similar_items(repo_ids['tensorflow/tensorflow'])][(u'tensorflow/tensorflow', 1.0000000000000004),
(u'jikexueyuanwiki/tensorflow-zh', 0.52015405760492706),
(u'BVLC/caffe', 0.4161581732982037),
(u'scikit-learn/scikit-learn', 0.40543551306117309),
(u'google/protobuf', 0.40160716582156247),
(u'fchollet/keras', 0.39897590674119598),
(u'shadowsocksr/shadowsocksr-csharp', 0.3798671235574328),
(u'ethereum/mist', 0.37205191726130321),
(u'pandas-dev/pandas', 0.34311692603549021),
(u'karpathy/char-rnn', 0.33868380215281335)]

Looks right! Almost everything in the list is related to machine learning and data science.

What about generating user recommendations? Well, we can get recommendations for users that were in the training set directly using model.recommend, but we’ll need to get the user stars from GitHub API for all the other ones.

Here’s the code that fetches stars from GitHub’s API and creates a new user-item matrix.

def user_stars(user):
repos = []
url = "https://api.github.com/users/{}/starred".format(user)
while url:
resp = requests.get(url, auth=github_auth)
repos += [r["full_name"] for r in resp.json()]
url = resp.links["next"]["url"] if "next" in resp.links else None
return repos
def user_items(u_stars):
star_ids = [repo_ids[s] for s in u_stars if s in repo_ids]
data = [confidence for _ in star_ids]
rows = [0 for _ in star_ids]
shape = (1, model.item_factors.shape[0])
return coo_matrix((data, (rows, star_ids)), shape=shape).tocsr()

Okay, which repositories should I check out?

def recommend(user_items):
recs = model.recommend(userid=0, user_items=user_items, recalculate_user=True)
return [(repos[r], s) for r, s in recs]
jbochi = user_items(user_stars("jbochi"))
recommend(jbochi)
[(u'ansible/ansible', 1.3480146093553365),
(u'airbnb/superset', 1.337698670756992),
(u'scrapy/scrapy', 1.2682612609169515),
(u'grpc/grpc', 1.1558718295721062),
(u'scikit-learn/scikit-learn', 1.1539551159232055),
(u'grafana/grafana', 1.1265144087278358),
(u'google/protobuf', 1.078458167396922),
(u'lodash/lodash', 1.0690341693223879),
(u'josephmisiti/awesome-machine-learning', 1.0553796439629786),
(u'd3/d3', 1.0546232373207065)]

I find the suggestions very useful. Notice that we are passing a whole new matrix of user ratings with a single user and setting the flagrecalculate_user=True.

This functionality was recently added and can be used to generate recommendations to users that were not in the training set or to update user recommendations once he or she consumes more items.

Another shameless plug of a functionality that I added to the library is the ability to explain recommendations:

def explain(user_items, repo):
_, recs, _ = model.explain(userid=0, user_items=user_items, itemid=repo_ids[repo])
return [(repos[r], s) for r, s in recs]
explain(jbochi, 'fchollet/keras')[(u'pandas-dev/pandas', 0.18368079727509334),
(u'BVLC/caffe', 0.15726607611115795),
(u'requests/requests', 0.15263841163355341),
(u'pallets/flask', 0.15259412774463132),
(u'robbyrussell/oh-my-zsh', 0.1503775470984523),
(u'apache/spark', 0.12771260655405856),
(u'tensorflow/tensorflow', 0.12343847633950071),
(u'kripken/emscripten', 0.12294875917036562),
(u'videojs/video.js', 0.12279727716802587),
(u'rust-lang/rust', 0.10859551238691327)]

It returns the repositories that I’ve starred that contributed the most for a particular recommendation. Results mean that the model is recommending keras because I’ve starred pandas and caffe.

I hope you liked! Here’s the notebook with all the code you need to run it with your username.

Don’t forget to give implicit a star. Ben deserves it.

--

--