The world’s leading publication for data science, AI, and ML professionals.

Making Your First Kaggle Submission

An easy-to-understand guide to getting started with competitions and successfully modelling and making your first submission.

Photo by Dayne Topkin on Unsplash
Photo by Dayne Topkin on Unsplash

In the world of Data Science, using Kaggle is almost a necessity. You use it to get datasets for your projects, view and learn from various notebooks shared generously by people who want to see you succeed in building good machine learning models, discover new insights into how to approach complex machine learning problems, the list goes on.

One of the best ways to try out your skills in the real world is through the competitions hosted on the website. The Titanic competition is the most wholesome, beginner friendly way to get started and obtain a nice feel of what to expect and how to approach a problem in the best possible ways.

In this article, I will walk you through making your first Machine Learning model and successfully entering your own ship to sail in the sea of these competitions.

Let’s go!

Understanding the data

First – open up the competition page. You will need to be referencing it every now and then.

For this particular task, our problem statement and the end goal are clearly defined on the competition page itself:

We need to develop an ML algorithm to predict the survival outcome of passengers on the Titanic.

The outcome is measured as 0 (not survived) and 1 (survived). This is the dead giveaway that we have a binary classification problem at hand.

Well, fire up your jupyter notebook, and let’s see what the data looks like!

df = pd.read_csv('data/train.csv')
df.head()
train data as a dataframe
train data as a dataframe

At the first glance – there is a mix of categorical and continuous features. Let’s take a look at the data types for the columns:

df.info()
info about train data
info about train data

I won’t go into the details about what each of these features actually represent about a Titanic passenger – I’ll assume you’ve read about it by now on the Kaggle website (which you should if you haven’t).

Data cleaning

This is the backbone of our entire ML workflow, the step that makes or breaks a model. We’ll be dealing with:

  1. Accuracy
  2. Consistency
  3. Uniformity, and
  4. Completeness

of the data, as described in this wonderful article. Read it later if you want to get the most in-depth knowledge about data cleaning techniques.

Likewise, explore the test data as well. I call it df_test.

Now go ahead and make a combined list of the train and test data to start our cleaning.

data = pd.concat(objs = [df, df_test], axis = 0).reset_index(drop = True)

Our target variable will the Survived column – let’s keep it aside.

target = ['Survived']

First we check for null values in columns of training data.

df.isnull().sum()
sum of null values in cols
sum of null values in cols

Right away we can observe that three columns seem quite unncessary for modelling. Also, the cabin column is quite sparsely represented. We could keep it and derive some kind of value from it but for now, let’s keep it simple.

data = data.drop(['PassengerId', 'Ticket', 'Cabin'], axis = 1)

Now we move on to other columns that have null values.

We substitute the median values for Age and Fare, while the mode value for Embarked (which will be S).

data.Age.fillna(data.Age.median(), inplace = True)
data.Fare.fillna(data.Fare.median(), inplace = True)
data.Embarked.fillna(data.Embarked.mode()[0], inplace = True)

Doing that, we now have no null values in our data. Job well done!

Feature engineering

Now, let’s create some new features for our data.

We create a column called ‘familySize‘ which will be the sum of our parents+siblings+children.

data['FamilySize'] = data.SibSp + data.Parch

Also, we want a new column called ‘isAlone‘ which basically means if the passenger of the Titanic was travelling alone aboard.

data['IsAlone'] = 1 #if alone
data['IsAlone'].loc[data['FamilySize'] > 1] = 0 #if not alone

Finally, we add one more thing, which is – the title of the passengers as a separate column.

Let’s create a new column first.

data['Title'] = data['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]

Now, let’s see how many unique titles were created.

data[Title].value_counts()
all titles of the passengers
all titles of the passengers

That is quite a lot! We don’t want so many titles.

So let’s go ahead and accumulate all titles with less than 10 passengers into a separate ‘Other‘ category.

title_names = (data.Title.value_counts() < 10) 
data['Title'] = data['Title'].apply(lambda x: 'Other' if title_names.loc[x] == True else x) 
# data.drop('Name', axis = 1, inplace= True) # uncomment this later

Now we look at the modified column.

This looks way better.

Alright, let’s finally transform our two continuous columns – Age and Fare into quartile bins. Learn more about this function here.

data['AgeBin'] = pd.qcut(data['Age'].astype(int), 5)
data['FareBin'] = pd.qcut(data['Fare'], 4)

This makes these two columns categorical, which is exactly what we want. Now, let’s take another look at our data.

data.head()
engineered data
engineered data

Label Encoding our data

All our categorical columns can now be encoded into 0, 1, 2… etc. labels via the convenient function provided by sklearn.

from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
data['Sex'] = label.fit_transform(data['Sex'])
data['Embarked'] = label.fit_transform(data['Embarked'])
data['Title'] = label.fit_transform(data['Title'])
data['Age'] = label.fit_transform(data['AgeBin'])
data['Fare'] = label.fit_transform(data['FareBin'])

Alright! Now only two more steps remain. First, we drop the unnecessary columns.

data.drop(['FareBin', 'AgeBin'], axis = 1, inplace = True)

and lastly, we onehotencode our non-label columns with pandas’ get_dummies function.

columns_train = [col for col in data.columns.tolist() if col not in target]
data = pd.get_dummies(data, columns = columns_train)

Let’s take another look at our data now, shall we?

one-hot-encoded data
one-hot-encoded data

Awesome! Now we can begin our modelling!

Making a SVM model

Split our data into train and validation sets.

train_len = len(df)
train = data[:train_len]
# to submit 
test = data[train_len:]
test.drop(labels=["Survived"],axis = 1,inplace=True)

Now our training data will have the shape:

train.shape
Output:
(891, 49)

We now drop the label column from our train data.

train["Survived"] = train["Survived"].astype(int)

and finally, we separate our label column from other columns.

columns_train = [col for col in data.columns.tolist() if col not in target]
label = train['Survived']
train = train[columns_train]
train.shape
Output:
(891, 48)

With sklearn’s splitting function, we split our train data into train and validation sets with a 80–20 split.

X_train, X_test, y_train, y_test = model_selection.train_test_split(train, label, test_size = 0.20, random_state = 13)

Let’s take another look at the data shape again.

X_train.shape, y_train.shape, X_test.shape, y_test.shape
Output:
((712, 48), (712,), (179, 48), (179,))

Perfect! We are ready to train our svm model.

Training the model

We import our support vector machine model:

from sklearn import svm

Next, we construct a simple classifier from it and fit it on our training data and labels:

clf = svm.SVC(kernel='linear') # Linear Kernel
clf.fit(X_train, y_train)

Good! We’re very near to making our final submission predictions. Great job so far!

Now, let’s validate our model on the validation set.

y_pred = clf.predict(X_test)
print("Accuracy: ", metrics.accuracy_score(y_test, y_pred))
Output:
Accuracy:  0.8379888268156425

This looks quite nice for a simple linear svm model.

Making the predictions for submission

We are finally here, the moment when we generate our first submissions predictions and hence, our file which we will be uploading to the competition website.

Let’s use our trained classifier model to predict on the test(submission) set of the data.

test_Survived = pd.Series(clf.predict(test), name="Survived")

Finally, we make a

ID_column = df_test["PassengerId"]
results = pd.concat([ID_column, test_Survived], axis=1)

We check the shape of our final prediction output dataframe:

results.shape
Output:
(418, 2)

Awesome! This is exactly what we need.

The last step is to make a csv file from this dataframe:

results.to_csv("svm_linear.csv",index = False)

And we are done!

Making the submission

Go to the competition’s website and look for the below page to upload your csv file.

making your submission!
making your submission!

All the code from this article is available in my repo here. Although, if you’ve followed along so far, you will already have a workable codebase+file for submission.

The README file also helps with additional things like building your virtual environment and a few other things, so make sure to check it out if you want.

yashprakash13/data-another-day


Learning Data science alone can be hard. Follow me and let’s make it fun together. 😁

Connect with me on Twitter.

Also, check out another article of mine that you might be interested in:

How to Build an End-to-End Deep Learning Portfolio Project


Related Articles