The world’s leading publication for data science, AI, and ML professionals.

Predict Who Survived the Titanic Disaster

Your First Kaggle Competition Submission- The Easy Way

Photo by Joseph Barrientos on Unsplash
Photo by Joseph Barrientos on Unsplash

Kaggle, owned by Google Inc., is an online community for Data Science and Machine Learning Practitioners. In other words, your home for Data Science where you can find datasets and compete in competitions. However; I struggled to complete my first competition submission because of, I would say, improper resources. I went through kernels (read as ‘articles’) for this competition but all of them were not designed for beginners. I mean as a beginner, I don’t want to see visualizations that I can’t perform or interpret, I just need to understand what’s happening in simple words.

But I finally made it. And If you are here because you too are struggling to get started with Kaggle, then my friend, this article is going to make your day. No unnecessary lines of code or visualizations, just a straight path to your first submission.

The brick walls are there for a reason. Not there to keep us out; but to give us a chance to show how badly we want something. The brick walls are there to stop the people who don’t want it badly enough. They are there to stop the other people.

  • Randy Pausch

Step 0 – First Things First

To go along with this getting started with Kaggle tutorial, you need to do 2 things. Mainly, head towards this link and get yourself a Kaggle account. After that, join the Kaggle Titanic competition by going to this link. Done? Great. We are all set. Let’s do some real work now.

Step 1 – Understand your Data

Once you sign up for the competition, you can find the data on the homepage of the competition. To load and perform very basic manipulation of data, I am using Pandas, a data manipulation library in python. If you’re not aware of it, I suggest you go to this 10-minute guide to get yourself familiar with it.

In Machine Learning, the data is mainly divided into two parts – Training and Testing (the third split is validation, but you don’t have to care about that right now). Training data is for training our algorithm and Testing data is to check how well our algorithm performs. The split ratio between the train and test data is usually around 70–30. Hence, here we have a total of 891 entries for training and 417 entries for testing. Loading up the data by writing will give you 12 columns, as shown below. We will call it features. Nothing new, just a fancy name. I encourage you to go through the data at least one time before moving forward.

PassengerId : int     : Id
Survived    : int     : Survival (0=No; 1=Yes)
Pclass      : int     : Passenger Class
Name        : object  : Name
Sex         : object  : Sex
Age         : float   : Age
SibSp       : int     : Number of Siblings/Spouses Aboard
Parch       : int     : Number of Parents/Children Aboard
Ticket      : object  : Ticket Number
Fare        : float   : Passenger Fare
Cabin       : object  : Cabin
Embarked    : object  : Port of Embarkation
                        (C=Cherbourg; Q=Queenstown; S=Southampton)

Also, an understanding of the data type of each feature is important. Now that we’ve loaded our data and understood what it looks like, we will move forward to feature engineering. In other words, measuring the impact of each feature on our output, that is whether a passenger survived or not.

Step 2 – Feature Engineering

As we discussed, feature engineering is measuring the impact of each feature on the output. But the more important thing is that it is not just about using existing features, it is about creating new ones that can make a significant improvement in our output. Andrew Ng said, "Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering." We will go through each feature we are using so that you can understand how to use existing features and how to create new ones.

2.1 – Passenger Class

It is obvious that the class of passenger is directly proportional to survival rate. If the importance of a person is more than others, they’ll get out of the disaster first. And our data tells the same story. 63% of people survived from Class 1. Therefore, this feature is definitely impactful. Data in Pclass column is complete hence no need to manipulate.

2.2 – Sex

Sex is again important and directly proportional to survival rate. Female and children were saved first during this tragedy. We can see that 74% of all females were saved and only 18% of all males were saved. Again, this will impact our outcome.

Feature 2 Output
Feature 2 Output

2.3 – Family Size

Next two columns are SibSp and Parch, which are not directly related to whether a person has survived or not. That is where the idea of creating a new feature came in. For each row/passenger, we will determine his/her family size by adding SibSp + Parch + 1(him/her self). Family size differs from a minimum of 1 to a maximum of 11, where the family size of 4 having the highest survival rate of 72%.

It seems to have a good effect on our prediction but let’s go further and categorize people to check whether they are alone in this ship or not. And after looking through it too, it seems to have a considerable impact on our output.

2.4 – Embarked

From which place a passenger embarked has something to do with survival (not always). So, let’s take a look. In this column, there are plenty of NAs. To deal with it, we are going to replace NAs with ‘S’ because it is the most occurred value.

Feature 4 Output
Feature 4 Output

2.5 – Fare

There are missing data in this column as well. We can not deal with every feature in the same way. To fix the issue here, we are going to take the median value of the entire column. When you cut with qcut, the bins will be chosen so that you have the same number of records in each bin (equal parts). Looking through the output, it is considerable.

Feature 5 Output
Feature 5 Output

2.6 – Age

Age has some missing values. We will fill it with random numbers between (average age minus average standard deviation) and (average age plus average standard deviation). After that, we will group it in the set of 5. It has a good impact as well.

Feature 6 Output
Feature 6 Output

2.7 – Name

This one is a little tricky. From the name, we have to retrieve the title associated with that name, i.e. Mr or Captain. To do that, we have to use the regular expression library of Python (Regular expression how-to ). First, we get the title from the name and store them in a new list called title. After that, let’s clean the list by narrowing down to common titles.

That’s it. We have cleaned our features and they are now ready to use. However; there is one more step before we feed our data to ML algorithm. The thing about ML algorithms is that they only take numerical values and not strings. So, we have to map our data to numerical values and convert the columns to the integer data type.

Step 3 – Mapping Data

Mapping data is easy. By looking through the code you’ll have the idea how it works. Once done, now we have to select which features to use. Feature selection is as important as feature creation. We will drop unnecessary columns so that it doesn’t affect our final outcome.

Final data that we will feed to ML algorithm
Final data that we will feed to ML algorithm

That is it. You have completed the hard part. Look at your data, it looks so beautiful. Now, we only have to predict our outcome which is easy stuff. Or at least I’ll make it easy for you to understand.

Jack, come on buddy, we're almost there
Jack, come on buddy, we’re almost there

Step 4 – Prediction

As we discussed, we require training and testing data. Yeah, Dhrumil, we have it, what now? Ok perfect. Now we need to train our model. To do that, we need to provide data in two parts – X and Y.

X : X_train : Contains all the features
Y : Y_train : Contains the actual output (Survived)

To elaborate further, we need to tell our model that we are looking for this output. So, it will train that way. For instance, your friend is out shopping and you want goggles, you send a photo of goggles to your friend saying you want the same. That’s training. You are training him/her, so he can bring similar goggles, by explaining features (Aviator, Wayfarers) and providing the exact output (picture of goggles).

We have data separated, now we call our classifier, fit data (training) with help of .fit method of the scikit-learn library, and predict the output on testing data, with .predict method.

Note – As this tutorial is for beginners, I am not including other classifiers but the process remains the same. Call classifier, fit data, predict. Just in case you want to explore further. There are several other classifiers, but I used Decision Tree because according to my knowledge it works best with this dataset. To know more about decision trees, refer to this article.

Yes, Tony, that's great for the first time
Yes, Tony, that’s great for the first time

Step 5 – Your First Submission

And finally, submitting our output. Our output .csv file should only have two columns – Passenger Id and Survived – as mentioned on the competition page. Creating that and submitting by heading over to competition page, my submission was scored 0.79425 which is in Top 25% at the time of writing this article.

The position I got on the leaderboard, where do you sit?
The position I got on the leaderboard, where do you sit?

I encourage you to explore different features to improve your model accuracy and your rank in this competition as well. I’d love to hear from you that you’ve made it to the Top 5% or even better, Top 1%. You will find the entire code on my GitHub repository.

Endnotes

I hope this article has answered your primary question "How to start with Kaggle?" Adequate knowledge, good resources, and willingness to learn new things is all you need to move ahead. You don’t have to be master from the beginning. It all comes with persistence. If you are reading this, you have all the energy to fulfill your goals, just don’t stop, no matter what. If you have doubts regarding this article, reach me through email or Twitter or even Linkedin. And even if you don’t have any doubts, I’d still love seeing you in my inbox with your valuable feedback or suggestions, if any.

Happy Learning.


Related Articles