Google Hash Code 2020: a greedy approach

How we obtained ranking 16/204 in Belgium and 531/10724 globally

Gilles Vandewiele
Towards Data Science

--

Google Hash Code

Every year, Google organizes a programming competition called Hash Code. The goal is to solve an optimization problem in 4 hours in a team of 2 to 4 people. The competition attracts tens of thousands of competitors from all over the world. This was the first year I heard of it, so I decided to participate along with Bram Steenwinckel and Pieter De Cremer.

Scanning books in multiple libraries

The goal of this year was to obtain the highest score by scanning different books that were distributed over different libraries in a limited number of days. Copies of the same book could be available in more than one library. Each book was assigned a score, and points could only obtained once per book (i.e. scanning two copies of the same books will only award points once). Moreover, each library had a certain “signup” time (start-up/initiation time) and a different throughput (number of books that can be scanned per day). Only one library can be performing the signup process at each time instance, while scanning of books can be done in parallel. An example (and (non-optimal) solution):

A simple example of the optimization problem. Books 0, 1, 2, 3, 4, 5 were assigned scores of 1, 2, 3, 6, 5, 4 respectively. The total number of points would be: points(day_0) + … + points(day_6). Day 0–2 do not award any points since no books are scanned. 4 points on day 3, 3 points on day 4, 9 points on day 5. No points are awarded on day 6, as books 2 and 3 are already scanned. Resulting in a total score of 16. This is of course not a great solution, as book 4 is never scanned and library 0 has a shorter signup time and higher throughput.

In total, we were given 6 different inputs/problems with varying number of characteristics (many libraries, many books, same scores for all books, no copies of books, etc.).

First approach

Our first approach was to assign a score to each of the libraries, which consisted of the sum of scores of the books it could scan before the deadline:

Calculate the scores for all libraries in a single pass by taking the sum of highest book scores that can be scanned in time

We then started the signup process of the libraries in order of their scores (descending). This resulted in a score of roughly 15.1 million points.

Re-calculating the scores

The first approach was very fast, as it only performed a single pass over the libraries. Unfortunately, it did not score that well as it did not take into account important factors. One example of such a factor is the fact that only one library can be performing its signup process at each time instance. As such, we can schedule a new library to start scanning after the signup process of the previous one. We updated our code to pick the best library, and then keep track of the total signup time that has already passed:

We now keep track of a time_total variable

This update achieved a score of 15.5 million. Our code was now a lot slower though, as it was quadratic in the number of libraries (assign scores to each of the libraries → pick best one →update counter → repeat). As such, we were no longer able to solve the fourth input problem, as the number of libraries was too high.

Removing scanned books from the pool

Another factor that is important is the fact that scanning multiple copies of the same book awards no extra points. As our code already kept track of the total time passed, we simple had to use a set to store the already assigned books in.

Our updated formula. books and assigned books are sets, \ is set difference

Taking signup time into account

The final adjustment we managed to make in time was updating the formula to calculate the average score it can achieve when taking into account the signup time, as opposed to just calculating the sum. This was very important for some problems where the number of books in each library was very low, as such they had a lot of idle time when they were done scanning all their books. In the first iteration, it would then pick the library with the highest scoring books, ignoring it’s throughput time. The final formula and the code for the final version can be found below:

Dividing by the signup time results in a significant gain

This update was by far the most significant one, as it boosted our score to 26.6 million.

Simplified faster version for Problem D

After making all these updates, we only had about 20 minutes remaining until the end. We decided to focus on problem D, since that was only solved by our first version of our code (which wasn’t too good). By inspecting the input, we saw that the problem had some very unique properties: all libraries had the same throughput and signup time and the score for each of the books was the same. The only difference was the number of books available in the libraries. As such, we calculated our score by just taking the number of yet unscanned books in each library. This ran for about 15 minutes and with only a few minutes remaining, we managed to improve the score for problem D with 20000, giving us a score of roughly 26.8 million which puts us on ranking 16/204 in Belgium and 531/10724 globally.

Our certificate for the qualification rounds.

Lessons learned & possible improvements

  • We inspected our input problems way too late. Each of the problems had unique properties, and your greedy approach should be tailored to those.
  • In order to write tailored approaches to each subproblem, you do need to have a strong basic version quickly.
  • Different tweaks could be made in order to improve the score even more. The number one achieved 27203691, which is only 1.5% higher than our score. Possible improvements include (i) taking into account frequencies of the books; (ii) adapting the formulas slightly; (iii) local neighborhood searching (making small adaptations to the solution to check if it improves); (iv) etc.

We definitely didn’t do bad at all, and it was a fun challenge! If you were part of one of the 530 teams ranking above us, or have some ideas of possible improvements definitely let us know!

Addendum: optimally solving the practice problem

This year, a simple toy problem was provided as well to serve as a warm up for the real competition. The problem statement was simple, you get a collection of pizzas with a different number of slices for each pizza and a number M. The goal is to order pizzas such that the sum of slices is as close to (and does not exceed) M:

If M would be equal to 17, then ordering pizzas S0, S2 and S3 is the most optimal solution, as that would give us a total of 16 slices.

It’s easy to see that this is just a rephrasing of the knapsack problem. Which is a walk in the park for powerful solvers such as Gurobi:

We basically create an array asm with K binary variables, with K the number of pizzas. If asm[i] = 1, then that means we are ordering pizza i. The objective is that the sum of asm[i]*score[i] is as high as possible (or the difference with M as low as possible), with an additional constraint that it cannot exceed M. The script will read all files in the directory and write away the output files.

--

--