In the world of Data Science, using Kaggle is almost a necessity. You use it to get datasets for your projects, view and learn from various notebooks shared generously by people who want to see you succeed in building good machine learning models, discover new insights into how to approach complex machine learning problems, the list goes on.
One of the best ways to try out your skills in the real world is through the competitions hosted on the website. The Titanic competition is the most wholesome, beginner friendly way to get started and obtain a nice feel of what to expect and how to approach a problem in the best possible ways.
In this article, I will walk you through making your first Machine Learning model and successfully entering your own ship to sail in the sea of these competitions.
Let’s go!
Understanding the data
First – open up the competition page. You will need to be referencing it every now and then.
For this particular task, our problem statement and the end goal are clearly defined on the competition page itself:
We need to develop an ML algorithm to predict the survival outcome of passengers on the Titanic.
The outcome is measured as 0 (not survived) and 1 (survived). This is the dead giveaway that we have a binary classification problem at hand.
Well, fire up your jupyter notebook, and let’s see what the data looks like!
df = pd.read_csv('data/train.csv')
df.head()

At the first glance – there is a mix of categorical and continuous features. Let’s take a look at the data types for the columns:
df.info()

I won’t go into the details about what each of these features actually represent about a Titanic passenger – I’ll assume you’ve read about it by now on the Kaggle website (which you should if you haven’t).
Data cleaning
This is the backbone of our entire ML workflow, the step that makes or breaks a model. We’ll be dealing with:
- Accuracy
- Consistency
- Uniformity, and
- Completeness
of the data, as described in this wonderful article. Read it later if you want to get the most in-depth knowledge about data cleaning techniques.
Likewise, explore the test data as well. I call it df_test
.
Now go ahead and make a combined list of the train and test data to start our cleaning.
data = pd.concat(objs = [df, df_test], axis = 0).reset_index(drop = True)
Our target variable will the Survived
column – let’s keep it aside.
target = ['Survived']
First we check for null values in columns of training data.
df.isnull().sum()

Right away we can observe that three columns seem quite unncessary for modelling. Also, the cabin column is quite sparsely represented. We could keep it and derive some kind of value from it but for now, let’s keep it simple.
data = data.drop(['PassengerId', 'Ticket', 'Cabin'], axis = 1)
Now we move on to other columns that have null values.
We substitute the median values for Age and Fare, while the mode value for Embarked (which will be S).
data.Age.fillna(data.Age.median(), inplace = True)
data.Fare.fillna(data.Fare.median(), inplace = True)
data.Embarked.fillna(data.Embarked.mode()[0], inplace = True)
Doing that, we now have no null values in our data. Job well done!
Feature engineering
Now, let’s create some new features for our data.
We create a column called ‘familySize‘ which will be the sum of our parents+siblings+children.
data['FamilySize'] = data.SibSp + data.Parch
Also, we want a new column called ‘isAlone‘ which basically means if the passenger of the Titanic was travelling alone aboard.
data['IsAlone'] = 1 #if alone
data['IsAlone'].loc[data['FamilySize'] > 1] = 0 #if not alone
Finally, we add one more thing, which is – the title of the passengers as a separate column.
Let’s create a new column first.
data['Title'] = data['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
Now, let’s see how many unique titles were created.
data[Title].value_counts()

That is quite a lot! We don’t want so many titles.
So let’s go ahead and accumulate all titles with less than 10 passengers into a separate ‘Other‘ category.
title_names = (data.Title.value_counts() < 10)
data['Title'] = data['Title'].apply(lambda x: 'Other' if title_names.loc[x] == True else x)
# data.drop('Name', axis = 1, inplace= True) # uncomment this later
Now we look at the modified column.

This looks way better.
Alright, let’s finally transform our two continuous columns – Age and Fare into quartile bins. Learn more about this function here.
data['AgeBin'] = pd.qcut(data['Age'].astype(int), 5)
data['FareBin'] = pd.qcut(data['Fare'], 4)
This makes these two columns categorical, which is exactly what we want. Now, let’s take another look at our data.
data.head()

Label Encoding our data
All our categorical columns can now be encoded into 0, 1, 2… etc. labels via the convenient function provided by sklearn.
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
data['Sex'] = label.fit_transform(data['Sex'])
data['Embarked'] = label.fit_transform(data['Embarked'])
data['Title'] = label.fit_transform(data['Title'])
data['Age'] = label.fit_transform(data['AgeBin'])
data['Fare'] = label.fit_transform(data['FareBin'])
Alright! Now only two more steps remain. First, we drop the unnecessary columns.
data.drop(['FareBin', 'AgeBin'], axis = 1, inplace = True)
and lastly, we one–hot–encode our non-label columns with pandas’ get_dummies function.
columns_train = [col for col in data.columns.tolist() if col not in target]
data = pd.get_dummies(data, columns = columns_train)
Let’s take another look at our data now, shall we?

Awesome! Now we can begin our modelling!
Making a SVM model
Split our data into train and validation sets.
train_len = len(df)
train = data[:train_len]
# to submit
test = data[train_len:]
test.drop(labels=["Survived"],axis = 1,inplace=True)
Now our training data will have the shape:
train.shape
Output:
(891, 49)
We now drop the label column from our train data.
train["Survived"] = train["Survived"].astype(int)
and finally, we separate our label column from other columns.
columns_train = [col for col in data.columns.tolist() if col not in target]
label = train['Survived']
train = train[columns_train]
train.shape
Output:
(891, 48)
With sklearn’s splitting function, we split our train data into train and validation sets with a 80–20 split.
X_train, X_test, y_train, y_test = model_selection.train_test_split(train, label, test_size = 0.20, random_state = 13)
Let’s take another look at the data shape again.
X_train.shape, y_train.shape, X_test.shape, y_test.shape
Output:
((712, 48), (712,), (179, 48), (179,))
Perfect! We are ready to train our svm model.
Training the model
We import our support vector machine model:
from sklearn import svm
Next, we construct a simple classifier from it and fit it on our training data and labels:
clf = svm.SVC(kernel='linear') # Linear Kernel
clf.fit(X_train, y_train)
Good! We’re very near to making our final submission predictions. Great job so far!
Now, let’s validate our model on the validation set.
y_pred = clf.predict(X_test)
print("Accuracy: ", metrics.accuracy_score(y_test, y_pred))
Output:
Accuracy: 0.8379888268156425
This looks quite nice for a simple linear svm model.
Making the predictions for submission
We are finally here, the moment when we generate our first submissions predictions and hence, our file which we will be uploading to the competition website.
Let’s use our trained classifier model to predict on the test(submission) set of the data.
test_Survived = pd.Series(clf.predict(test), name="Survived")
Finally, we make a
ID_column = df_test["PassengerId"]
results = pd.concat([ID_column, test_Survived], axis=1)
We check the shape of our final prediction output dataframe:
results.shape
Output:
(418, 2)
Awesome! This is exactly what we need.
The last step is to make a csv file from this dataframe:
results.to_csv("svm_linear.csv",index = False)
And we are done!
Making the submission
Go to the competition’s website and look for the below page to upload your csv file.

All the code from this article is available in my repo here. Although, if you’ve followed along so far, you will already have a workable codebase+file for submission.
The README file also helps with additional things like building your virtual environment and a few other things, so make sure to check it out if you want.
Learning Data science alone can be hard. Follow me and let’s make it fun together. 😁
Connect with me on Twitter.
Also, check out another article of mine that you might be interested in: