How to begin your own data science journey!

Hey there ,

It wasn’t long time back when businessmen used to go to these astrologers for predicting how their next financial year would be. Though it was baseless and uncertain the people used to make decisions either based on that or advice by some person good at that field. But now we are advancing and have advanced to such a level that we accept everything based on facts and numbers.

We live in a world where there is abundance of data , You go to domino's and order pizza the first thing they ask is your number and through that mobile number they pick up every information from address to past orders but is the data limited to that ? or can we do something else from that data ? Here’s where data scientist come into the picture.

Now , Let’s talk about the tools used for analyzing the data.

  1. SAS — Short for Statistical Analysis System used for advanced analytics , Data management , Business intelligence. It’s a licensed software developed by NCSU from 1966 until 1976. SAS is still used widely and most of the fortune 500 companies use SAS.
  2. R — R language is an open source language for analysis and statistical models. Many open source libraries are being created for R language. Predominantly works on command line interface
  3. Python — My personal favorite , Python has been nothing short of revolutionary and today in this article we are going to use python. It’s a high level programming language created by Guido Van Rossum. It’s open source hence so many libraries are made everyday. In fact , Python is the ideal language if you want to make a career in Machine learning and artificial intelligence.

Today we are going to look into Python for data science .

    import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Here we are importing our three basic dependencies , approximately 90% of your projects would require you to import these three dependencies. So what are these dependencies ?

  1. Numpy — It’s a built in python library which helps in doing mathematical functions such as matrix multiplication , conversion , etc
  2. Pandas — The most important library , this library is used to import dataset and create data frames. which can be further used for analysis or prediction whatever you want to do!
  3. Matplotlib — it’s a tool used for data visualization and representation.
  4. %matplotlib inline —Since , I’m going to use jupyter(Ipython notebook) notebook i want my output (graph) to be inside the notebook.
train = pd.read_csv('train.csv')

So we are now ready , we import the data by using the above command. pd is short for pandas as mentioned above (import pandas as pd). read_csv is a function inside pandas library. train.csv is a file present in on the anaconda directory. we have successfully uploaded the file into our python ecosystem and created a data frame out of it and are ready to perform functions!

train.head()

What is head ? top part of a human body isn’t it ? similarly pandas uses head function to give us an overview about the top part of the data frame. By default head() would return us the first 5 rows of the data frame.

Similarly if we put head(20) it would return us the first 20 rows! interesting isn’t it ?

Similarly we can use tail() to see the last 5 (by default) of the data frame.

The output would for head() would be :-

output for the head() command

So now , we’ve got the first 5 rows and columns of the data frame.

The most important thing is to first know what dataset are you dealing with questions like size , shape and description is very important and really useful as we progress further so what are the key things we need to know about the dataset ? Since few datasets are huge and would be real pain to work with we need to find useful information and eliminate unwanted information sounds easy but identifying those could be difficult. It’s said to be a good practice to look into the shape of the dataset.

shape and size of the data

Here we can see the data has 891 rows and 12 columns and total data size of 10692.

Let’s see some basic statistics about the data we have right now

Basic statistics about the data

Here the statistics of data such as count , mean , percentile ,standard deviation are seen these are important when we play with some financial data or performance related data.

Moving on , we are going to do some data visualization and this is one of the most important skill a data science must know. As discussed during importing the dataset we are using matplotlib though there are various other libraries if you search for it on google but matplotlib here serves our purpose.

It is very important for a data scientist to know which representation to be used , we are first going to talk about that and then move into the code.

Types of representation

When we visualize data we must have few things in our mind that is ,

  1. How many variables are to be shown on single chart ?
  2. Are there several items for a single point of data or many ?
  3. Are we displaying the data for a period of time ? or are we grouping it ?

These factors influence selection of chart.

Which chart ?

The image above would help us mapping which chart type to be used and when.

But it’s highly recommended to learn about data visualization and master it. Since many company want data to tell us the story.

Visualization tools such as Tableau , PowerBI can tell us story about the entire data by creating a dashboard which is very useful.

Now , we look at how to visualize data in python using matplotlib

Age vs number of passengers (Age distribution of passengers)

We already imported matplotlib as plt , here we first start by defining the figsize which means figure size. Usually if we are not defining the figsize would be set default , Then we use our data and plot the age.

We can set labels , xlabel means x axis label , ylabel means y axis label and title is used to define title for the graph.

After seeing the graph we can infer something from the data , what all things can be inferred ?

  1. Young people were more on board.
  2. Old people were really less.
  3. People of the age 22 were highest.
  4. Oldest person to travel was 79.

And we can infer a lot more by looking at the graph.

Similarly we can also make many other graph from dataset such as this one ,

Survival of people by age

The dataset tells us if a person survived or not by using the binary’s interesting isn’t it ? This is how we wrangle data for our needs which we would do later on to predict more things using the statistical model.

Let’s move on and train our computer to predict if a passengers have survived depending upon the data available.

Machine Learning algorithms

Till now , we saw how to import data , visualize data and how to infer from data now we’ll see which algorithms is to be used and when ?

There are two types of machine learning algorithms

  1. Supervised Learning :- We use supervised when we have labelled data. The machine looks for patterns within value labels to the data points.
  2. Unsupervised Learning :- We use it when we don’t have labelled data. The machine organizes data into clusters and compares to find relationship.

The examples of supervised learning are regression and unsupervised is Naive Bayes.

This link would help you understand which algorithm is the best fit and which one is to be used.

But for this data we are going to use Logistic regression and how do we do that ?

Logistic regression helps us predicts if the data in form of true or false. Basically set of data are given to the machine and the machine calculates using the regression formula and gives the answer in binary format

According to wikipedia , In statistics, logistic regression, or logit regression, or logit model[1] is a regression model where the dependent variable (DV) is categorical. This article covers the case of a binary dependent variable — that is, where it can take only two values, “0” and “1”, which represent outcomes such as pass/fail, win/lose, alive/dead or healthy/sick. Cases where the dependent variable has more than two outcome categories may be analysed in multinomial logistic regression, or, if the multiple categories are ordered, in ordinal logistic regression.In the terminology of economics, logistic regression is an example of a qualitative response/discrete choice model.

So how does logistic regression help us in this ?

We have the columns survived in binary format. So it won’t be a problem but we would need to change gender column to 1s and 0s so that we can predict if depending upon gender as one of the variable.

We are required to import sklearn , Built in python library again for statistical tools. Sklearn serves all purpose and is a very powerful tool to work with.

How do we do that ?

from sklearn.linear_model import LogisticRegression 

We import logistic regression which is inside sklearn library.

Prerequisite for Logistic Regression is to have two sets of data

  1. Train data — this data is used to train the computer
  2. Test — Test data is usually small in size and is used to check accuracy of the machine on the data
Converting genders into binary format

After converting them into binary format, we can now use LogisticRegression function to predict our outcome.

First we set Survived column from train as an output for Logistic regression model to understand.

Since, we have a separate train and test dataset we can easily use it without using train and test split function.

Prediction using logistic regression

Let’s go step by step on this one ,

  1. We mark survived column as our label (output) and use data_train (variable) as our training input without the survived column
  2. Then we import sklearn , and define model (variable) to use the function logistic regression.
  3. We then fit our model , here’s where our computer tries to recognize the pattern. This enables our computer to give predictions when a similar data is passed on to it .
  4. we have a separate data called test ,where survived doesn’t exist. We use the function predict to predict the outcome based on the training dataset.

This is how the computer predicts from learning. next time we would other models and would also see what is accuracy score and matrix score.

Hope you enjoyed it and got something out of it , please share if it did , and see you next time.