How to LB probe on Kaggle

Published in

Towards Data Science

8 min readJun 5, 2019

In LANL Earthquake Prediction competition on Kaggle our team finished first place on the public leader-board with 4500 competing teams by using LB probing strategy. This is an overview describing that strategy.

We start by giving a short intro into the competition setting. Then we proceed with our discoveries about the test data, and finally describe the main approach that we employed, — mixed integer programming to iterate through feasible solutions with gradual score improvement.

The competition train data is comprised of 629 million continuous measurements, which span 163 seconds of a laboratory experiment and includes 16 earthquake cycles. The times between the earthquakes in the train data are ranging from 7 to 16 seconds. Based on this training signal the competitors are asked to predict time remaining to the next failure on 2624 test segments, each segment containing 150k continuous measurements or about 0.038 seconds of the test data. The goal is to minimize a mean absolute error of the predictions.

About one month before the competition end it was reported on the forum that the data was probably coming from a specific laboratory experiment, and an academic article describing the experiment contained a picture of the train set which matched exactly with the competition train set.

The picture also contained the suspected test set, and it was safe to assume at that point that the test data segments had been cut from a continuous interval 108 seconds long which followed the train set.

Test data segments were further split by the organizers into public and private data set, with public data known to contain 13% of the segments. However it was not disclosed how many exactly segments in the public part, nor which one is private and which one is public.

Each team is given two submissions a day, where a participant is asked to give his predictions for all 2624 test segments. The system answers with public set MAE score which it calculates internally only on the public segments subset, and the best score of each team appears on the public leader-board.

A submission with all 10 gave me a score of 5.982 and all 20 score 15.982, which meant that there are no values in public leader-board higher than 10.

Then we made another submission by setting all values to 10 but additionally changing 60 random segments values to 10000. The score was 293.051, and from that it was easy to conclude that the only integer number of segments in the public set that was possible is 348, and we happened to hit 10 public segments in the selected 60 segments.

Next, we submitted the same value for all predictions for the following list of values: [0, 2, 3, 3.25, 3.5, 3.75, 4, 4.25, 4.5, 5, 9, 9.72, 10]. Based on the resulting scores and assuming that density of segments is constant for each of the intervals, one can calculate the density of the segments on each of the intervals. In fact, the problem is even over-constrained and some optimization technique or heuristic is required to solve it.

Note that the train set last earthquake is truncated, stopping at 9.76. Therefore the test set starts at 9.72 at the earliest (subtracting one segment duration 0.038). This is where the value 9.72 came from in the list above, it was just a guess that this value can be relevant.

fig. 1: density of public segments targets

Without going into details, we formulated the problem as a quadratic programming convex optimization problem and solved with cvxopt package. It resulted in segments density depicted on the image (see fig. 1).

fig. 2: density of public segments targets only with scores 0, 4.5, 9.72 and 10

The obtained result is noisy, because the assumption that the density of segments is constant for each of the intervals does not hold precisely, after all we only have about 348 discrete values, it is neither uniform nor continuous. But still one can see clearly that we have two regimes. By leaving only 4.5 and 9.72 values in the model we get much cleaner result, see the image (fig. 2). The calculated densities are 0.141, 0.070 and 0.001. If you are interested in making the calculations yourself and verify this result the scores are the following:

4.017 for all 0
2.379 for all 4.5
5.702 for all 9.72
5.982 for all 10

The above calculations proved to us decisively that the public set consists of two continuous intervals: 9.72 to zero, which is apparently the continuation of the last train earthquake, and another interval from 4.5 to zero. I bet this last 4.5 seconds is taken from the last test data earthquake, making the private test data continuous.

By knowing the total test duration from the article we also estimated that there are gaps of 7500 samples on average between consecutive test segments. Taking this gap into consideration one can calculate that into two intervals 0 to 9.72 and 0 to 4.5 can fit exactly 348 intervals, further confirming the conclusions.

Next we made 88 submissions to discover what segments were public and what private (yes, it took 44 days!).

In each such submission we set all values to 10, except for 30 segments, for which we selected values

val[i] = 10 + 0.348*2^i

The resulting submission score subtracting 5.982 (the score of all 10 alone) in binary representation gave us the public/private mask for those 30 segments.

Now when we know the public segments indices, one way to discover the public segment values is to submit all values 10 and change one public segment at a time to zero. This procedure would take 348 submissions to cover the whole set, which was infeasible.

The approach that we adopted is based on mixed integer programming (MIP), and this the magic that allowed us to improve our public LB score at much faster pace. The problem formulation is that we need to match 348 segments to 348 values, where the values are equally spaced from 0 to 9.72 and from 0 to 4.5, as discussed above. Each submission that we did serves us as a constraint in this formulation. Full MIP problem formulation is presented below, we used pulp python package and cplex as a solver (academic license). The syntax in the code is that of pulp.

# decision variables
# matrix of indicators if a segment selects a value
pp=LpVariable.dicts("mix",(range(D),range(D)),0,1,LpInteger)
# absolute value of errors for submission-segment pairs
tt=LpVariable.dicts("err",(range(S),range(D)),0,20,cat='Continuous')
# margins for submission scores
dd=LpVariable.dicts("div",(range(S)),0,100,cat='Continuous')# D is number of segments = 348
# S is number of submissions available
# den is a probability density of a segment to have a value
# sub is predictions for public segments in each submission
# times is a list of all 348 public segments values
# sc is the submissions scores# objective
prob += lpSum(lpSum(pp[d][i] * den[d,i] for i in range(D)) for d in range(D)) + 1000*lpSum(dd[s] for s in range(S))

# constraints
for d in range(D):
   # one selected for each segment
   prob += lpSum(pp[d][i] for i in range(D)) == 1

for i in range(D):
   # one selected for each value
   prob += lpSum(pp[d][i] for d in range(D)) == 1

for s in range(S):
   for d in range(D):
      # error of each submission-segment
      prob += tt[s][d] == lpSum(pp[d][i] * abs(sub[s,d] - times[i]) for i in range(D))

for s in range(S):
   # sum of errors is +-0.0005 rounding error plus margin
   prob += lpSum(tt[s][d] for d in range(D))<=D*(sc[s]+0.0005+dd[s])
   prob += lpSum(tt[s][d] for d in range(D))>=D*(sc[s]-0.0005-dd[s])

The objective function that we minimize consists of two parts. First, we want to find feasible solution, therefore we want to bring margins to zero (decision variable dd). Second, between all feasible solutions we want to select the maximum likelihood one. For this we used the output from a separate LGBM classification model which was trained regularly on the train set.

The process for MIP iterations was the following. For a given set of available submissions we run the model, obtained feasible maximum-likelihood solution and this is the result we submitted to Kaggle. The set of feasible solutions was still quite big, we usually improved our score by 0.000–0.080 each time. Then we added the new submission to the set and repeated the process with the new constraint.

Note that this MIP problem is still quite big, reaching 245k variables and 124k constraints at the end. We run the model for several hours per iteration and even that was not reaching the exact feasible solutions. Average margin across the submissions was 0.003 at the end (actually impressively good).

At the end we had 353 submissions with scores as constraints, which included all our own relevant submissions + 261 submissions that we downloaded from the public kernels.

As a summary, we developed a MIP model formulation which was looking for feasible matching between 348 public segments and the values that we know to comprise the public test set. At each iteration the model generated a feasible solution which satisfies all the constraints, — submissions with scores that we already had. It is a feasible solution in a sense that the model thinks a score 0 is possible if this solution is submitted. After actually submitting the generated values we got a real score instead of 0, and run the model again so it could correct itself given the new information.

The property of the model is that it generates a matching which satisfies all the submissions. Namely, if you calculate MAE metric between the generated solution and any submission public entries you get the actual submission score (with average error of 0.003 as mentioned above).

The fascinating part is that we added so many public kernels submissions that we completely covered that specific subspace. We stopped fetching more public kernels submissions when we observed that our model solution predicts 95% of the scores with maximum errors of 0.02 without learning on them specifically! And on the opposite side, the submissions generated by the model had predicted value of zero, been orthogonal to all the others, and providing a lot of new information.

During the last competition day the generated by MIP solution was predicting scores on the subspace of “regular” submissions so well that we used it to estimate public LB scores of all our final submissions candidates (the participants were asked two select two submissions as final submissions).

Unfortunately we arrived at been ready to run MIP model quite late in the competition and had time to submit only 17 such submissions. Their scores are shown below.

1.641 MIP 1th
1.511 MIP 2th
1.681 MIP 3th - Me giving only 20% to make any significant progress
1.304 MIP 4th - Better score than our best model
1.409 MIP 5th
1.326 MIP 6th
1.357 MIP 7th
1.306 MIP 8th
1.412 MIP 9th
1.231 MIP 10th - First place reached
1.226 MIP 11th
1.225 MIP 12th
1.145 MIP 13th
1.110 MIP 14th
1.101 MIP 15th
1.080 MIP 16th
1.104 MIP 17th

Warm thanks to my Kaggle teammate Ido Greenberg for helping with the process and supporting the approach even when I was estimating it to have only 20% chance of producing anything of value. It was a risky strategy that actually worked.

How to LB probe on Kaggle

Written by Zahar Chikishev