
It was middle February of 2020 when my friend Mario talked to me about the most important national Datathon for Bachelor and Master’s Degree students. Back then I was a sophomore and he was a junior, both from Computer Science, at the Polytechnic University of Valencia.
What is more, we did not know anything about Data Science nor AI.
So, we signed up for the contest and started to learn Python, basic theory of machine learning, and practice a lot with online tutorials.
The challenge we chose was the Minsait Land Classification. The goal was to build a multi class classifier to determine the type of given floor areas. The metric which would check the model was described as follows:

The training data consisted of 55 variables:
- Identifier: A unique ID for each area.
- Coordinates: Two variables, X and Y, describing the location of each area in the plane.
- Color variables: 11 variables of each of Red, Green, Blue and Infrared channels, describing the deciles.
- Area: Extension in m².
- Geometric variables: 4 variables that give geometric information, without any specification.
- Year of construction of the smallholding.
- Greatest height built at the moment in the given plot.
- Land register identifier: represents the quality of the area. It takes the values in range [A, B, C, 1, 2, 3, 4, 5, 6, 7, 8, 9]; being A the worst type and 9 the best of the best.
- Class: The target to predict. There were 7 different classes: Residential, Industrial, Public, Office, Other, Retail and Agriculture.
The testing data consisted of only the features.
To submit our result, we had to run our model on the testing data and send the predictions in the form AREA_ID | PREDICTED_CLASS.
If two teams differed from less than a 5% in accuracy, it would be considered a tie. In that case, the judges would take into account factors like the documentation of the code and presentation to break it.
Only the top 3 teams would got to Madrid (which they had to cancel later due to the covid-19 crisis), and earn:
- 1st place, 6000€ per team.
- 2nd place, 2400€ per team.
- 3rd place, 1200€ per team.
First problems appeared soon.
The features of largest height and land register quality had some null values. We set them to -1. Then, we checked for classes distribution on the training set, and found this:

It was hugely unbalanced! The residential class represented 87.28% of the features, while the agriculture one only 0.33%.
Furthermore, comparing the means of each feature between the train and test datasets, we found that:
- Mean area of residential floor in the training set was 281m², while in the rest was bigger than 1000m². The total mean was 441m².
- The total mean from the testing set was 967m².
So, we could know that the classes distribution differed between both datasets.
That’s how we found the two main drawbacks.
We spent the first month reading about basic concepts of Machine Learning, the different types of learning, and specifically for the supervised way, linear regression, binary classification, multi-class classification… Additionally, we had to learn Python, which luckily was easy.
Once we thought we had understood a bit of our mission and the options we had, we started working on the problem.
After trying several classification models, we found that combining XGBoostClassifier + RandomForestRegressor would give us the best results, and to deal with randomness, we trained various times and finally assigned each class by consensus, taking into account the mode. We also had to deal with the imbalance problem, soy we decided to only pick 8000 samples of the residential class to train, picked randomly. Some hyperparameters tuning using gridsearch, and we were all set to send our final submission in a Jupyter notebook.
Final results arrived days later, and we had achieved a 69.999% of accuracy, being the best national result by that moment 0.7061%. We were close.
We became the University winners and Spanish finalists! Astralaria team was still alive in the competition!
It was time to hustle. We then realized that our opponents were much older and experienced. In fact, the rest of the teams were formed by master’s degree students in data science, even some of them were already working on it.
Hence, if we wanted to get a good position we had to start looking for interesting insights on data, and trying new models.
Visualization tools may help us to find patterns to differentiate classes easier, we thought. So we studied PCA and T-SNE for dimensionality reduction, as well as violin plots for each feature. These were the revealing results we obtained:


All classes overlapped a lot, and we could not find any pattern or group.
After this, we tried some minor changes on the data, which seemed to improve a bit our result:
- Convert the land register quality feature to 12 one hot variables.
- Change the construction year to the number of years of antiquity.
- Change the null values to 0, instead of to -1.
We still had many things to try, and we had less than one month to run some experiments, with our modest laptops.
That is how we split the work.
Mario would try to reduce dimensionality using autoencoders, build some fancy model with neural networks, and test how Extra Trees and Support Vector Machines could help.
I would work on over and under sampling techniques, One Against All – Data Balancing algorithm, and feature engineering.
After around two weeks of researching and experimenting, we thought we had found something.
My colleague tested and confirmed that the best ML models we could use were the two we had been utilizing until the moment. By my side, I had two intuitions.
- The neighbors. If an area is residential, wouldn’t its neighbors be in a majority, also from the same class? According to this idea, we built a K-D Tree, which would determine the 4 nearest neighbors to each area (in the training dataset), or the probability of having a neighbor of a class (in the testing dataset). This procedure increased the dimensionality by 7 features, one for each class.
- OAA-DB Algorithm. Cited above, at first it impressed us because it improved the results in about 10%, but we finally discarded it because it leaded to huge overfitting.
We also experimented with Deep Feature Synthesis, and some more tools, but only the neighbors technique seemed to work well, it seemed to improve the accuracy in around 1.5%.
We only had left one thing to determine: the undersampling technique to use for reducing the amount of samples of the residential class in the training set. We experimented with all the imblearn package methods, however, we finally decided to make a random undersampling, reducing to 6000 samples. Almost the same we started with.
Submission day. I won’t forget it. The deadline was 13:00. Almost everything was prepared. I woke up really early to end the details of documentation. Mario would join me later.
We still had not generated the results.
10:30. I had already finished documenting the code, and passed the notebook to Mario.
10:40. There was a bug in the Json file behind it, which we didn’t understand. My colleague tried to open it in Visual Studio code, Google Colab, and Jupyter. But nothing worked.
11:00. New notebook. Still same issue.
11:35. Python script generated, with both parts, and documented. Mario could not execute it.
12:15. Mario was finally able to run the script to train the model and get the results. However, we didn’t know if it would end at time. But we couldn’t stop it.
It was time to believe in luck.
12:53. The program finished!
12:58. Results file sent. We made it.
Two weeks after, the best three teams were revealed. We had not made it to the top 3. We finally could not win any prize.
At first, we felt really sad. Nevertheless, we thought we had already come really far, especially taking into account that we started by knowing nothing.
Additionally, the organization told us that they would communicate us later our final position, as well as the accuracy in the last delivery and some tips to improve.
Some weeks after that, it arrived.
We had placed 4th overall! Moreover, as the rest of the teams were formed by master’s students, we could be considered as 1st place of bachelors level!
We felt really happy, even if we did not get any cash prize, we had been able to learn a lot, have fun, and discovered the amazing fields of data science and AI!
Personally, I have to thank my teammate and the organizers of the competition for making everything possible.
To the reader of this article: don’t worry if you still don’t know anything about a certain field, put the hard work it deserves, and learn, even if you do it by participating in a contest.
Check out our final work here.
Thanks for reading! 😀