Data Science and Python go together hand and hand. There are many places and ways for you to learn and use Python. One wonderful place is a website called Kaggle.
If you are looking to have access to courses, notebooks, datasets, a large community, and competitions for free, then look no further than Kaggle. In this exercise we will be using a Kaggle Notebook to explore and make small insights into a YouTube Trends dataset.
In this notebook I will perform a basic example of building and running an OLS Linear Regression model to explore some correlations between YouTube video views and feature data including likes, dislikes, and comment counts.
Sources
I will be using the Trending YouTube Video Statistics dataset obtained through the YouTube API by Mitchell J on Kaggle here:
The source code can be found here:
The dataset can be found here:
The license for the dataset can be found here.
My Content
My GitHub repository can be found here:
My Kaggle Notebook can be found here:
References
I also recommend some other articles I authored if you are new to Python.
A brief history of the Python programming language
A Complete Beginners Reference Guide to Python
The best IDEs and Text Editors for Python
An Overview of The Anaconda Distribution
An Overview of The PEP 8 Style Guide
Table of Contents
What is OLS Linear Regression?
Building and Running the OLS Linear Regression Model
Running code in Text Editor or IDE
What is OLS Linear Regression?
In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being observed) in the given dataset and those predicted by the linear function.
Source & Acknowledgement:
Wikipedia contributors. (2020, October 4). Ordinary least squares. In Wikipedia, The Free Encyclopedia. Retrieved 16:21, November 9, 2020, from https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&oldid=981750893
What is the sklearn Module?
The sklearn
module for python is used for predictive data analysis. You can learn more from their website and docs.
You can also check out the pandas
website and docs here:
Imports & Loading the Dataset
The first step is to make the necessary imports and load the dataset.
# import modules
import matplotlib.pyplot as plt
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# load data
data = pd.read_csv('../input/youtube-new/USvideos.csv')

Preparing the Data
Then preprocess the data and get it ready for the OLS Linear Regression model.
# break data down for analysis
df = data[['title', 'views', 'likes', 'dislikes', 'comment_count']]
views = data[['title', 'views']]
likes = data['likes']
dislikes = data['dislikes']
comment_count = data['comment_count']
# create feature list
train_list = [likes, dislikes, comment_count]
# print head to important data
df.head()

Building and Running the OLS Linear Regression Model
Next create and run the model from the sklearn
module. Make sure to check the Mean Squared Error and the r2 Score.
# create scaler variable
scaler = StandardScaler()
# get feature titles ready
labels = ['likes', 'dislikes', 'comment_count']
# get y ready and preprocessed
y = views['views']
y_scaled = y.values.reshape(-1, 1)
y_scaled = scaler.fit_transform(y_scaled)
# get x ready and preprocessed
for i, x in enumerate(train_list):
x_scaled = x.values.reshape(-1, 1)
x_scaled = scaler.fit_transform(x_scaled)
#split data for fitting and predicting with the model
X_train, X_test, y_train, y_test = train_test_split(x_scaled, y_scaled, test_size=0.33, random_state=42)
# create model
reg = linear_model.LinearRegression()
# fit model
reg.fit(X_train, y_train)
# make prediction
y_pred = reg.predict(X_test)
# check the mean squared error
mse = mean_squared_error(y_test, y_pred)
# check the score function
r2s = r2_score(y_test, y_pred)
#print feature labels
print(labels[i])
# print mse
print(f'Mean squared error: {mse}')
# 1 equals perfect prediction
print(f'Coefficient of determination: {r2s}')
# plot the visuals for a sanity check
plt.plot(x, y, label=labels[i])
plt.title(f'{labels[i]} effect or YouTube views')
plt.xlabel = "views"
plt.ylabel = labels[i]
plt.legend()
plt.plot()

Both the visual output and the model’s output point to a higher correlation between likes and video views than dislikes and comment counts.
Running code in Text Editor or IDE
If you want to run this in a Text Editor or IDE like PyCharm you can use the code in this Gist from my GitHub Repository.
Conclusion
While OLS Linear Regression can be good for identifying more basic correlations and insights into some data, it can sometimes be limited, inaccurate, or restrictive to certain types of problems. Make sure you choose the right model for the right task. This is a simple example to explore the basics of Data Science in Python. This article is also serving as a brief introduction to Kaggle Notebooks and Datasets, as well as sklearn
and OLS Linear Regression. I hope this helped anyone beginning with Data Science in Python. Thank you and happy Coding!