The world’s leading publication for data science, AI, and ML professionals.

Getting Started with Data Science in Python

Preforming Simple OLS Linear Regression on a YouTube Trends Dataset.

Photo by Yash Jain on Unsplash
Photo by Yash Jain on Unsplash

Data Science and Python go together hand and hand. There are many places and ways for you to learn and use Python. One wonderful place is a website called Kaggle.

Kaggle: Your Machine Learning and Data Science Community

If you are looking to have access to courses, notebooks, datasets, a large community, and competitions for free, then look no further than Kaggle. In this exercise we will be using a Kaggle Notebook to explore and make small insights into a YouTube Trends dataset.

In this notebook I will perform a basic example of building and running an OLS Linear Regression model to explore some correlations between YouTube video views and feature data including likes, dislikes, and comment counts.


Sources

I will be using the Trending YouTube Video Statistics dataset obtained through the YouTube API by Mitchell J on Kaggle here:

Mitchell J | Kaggle

The source code can be found here:

mitchelljy/Trending-YouTube-Scraper

The dataset can be found here:

Trending YouTube Video Statistics

The license for the dataset can be found here.

Creative Commons – CC0 1.0 Universal

My Content

My GitHub repository can be found here:

third-eye-cyborg/LinearRegressionOLS-YouTubeTrends

My Kaggle Notebook can be found here:

OLS Linear Regression | YouTube Trends


References

I also recommend some other articles I authored if you are new to Python.

A brief history of the Python programming language

Python – Basic Overview

A Complete Beginners Reference Guide to Python

The best IDEs and Text Editors for Python

An Overview of The Anaconda Distribution

An Overview of The PEP 8 Style Guide

Exploring Design Patterns in Python

Introduction to Pandas


Table of Contents

What is OLS Linear Regression?

What is the sklearn Module?

Imports & Loading the Dataset

Preparing the Data

Building and Running the OLS Linear Regression Model

Running code in Text Editor or IDE

Conclusion


What is OLS Linear Regression?

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being observed) in the given dataset and those predicted by the linear function.

Source & Acknowledgement:

Ordinary least squares

Wikipedia contributors. (2020, October 4). Ordinary least squares. In Wikipedia, The Free Encyclopedia. Retrieved 16:21, November 9, 2020, from https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&oldid=981750893

What is the sklearn Module?

The sklearn module for python is used for predictive data analysis. You can learn more from their website and docs.

scikit-learn

You can also check out the pandas website and docs here:

pandas

Imports & Loading the Dataset

The first step is to make the necessary imports and load the dataset.

# import modules
import matplotlib.pyplot as plt
import pandas as pd
import os

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# load data
data = pd.read_csv('../input/youtube-new/USvideos.csv')
Kaggle Notebook Output by Author
Kaggle Notebook Output by Author

Preparing the Data

Then preprocess the data and get it ready for the OLS Linear Regression model.

# break data down for analysis
df = data[['title', 'views', 'likes', 'dislikes', 'comment_count']]
views = data[['title', 'views']]
likes = data['likes']
dislikes = data['dislikes']
comment_count = data['comment_count']

# create feature list
train_list = [likes, dislikes, comment_count]

# print head to important data
df.head()
Kaggle Notebook Output by Author
Kaggle Notebook Output by Author

Building and Running the OLS Linear Regression Model

Next create and run the model from the sklearn module. Make sure to check the Mean Squared Error and the r2 Score.

# create scaler variable
scaler = StandardScaler()

# get feature titles ready
labels = ['likes', 'dislikes', 'comment_count']

# get y ready and preprocessed
y = views['views']
y_scaled = y.values.reshape(-1, 1)
y_scaled = scaler.fit_transform(y_scaled)
# get x ready and preprocessed
for i, x in enumerate(train_list):
    x_scaled = x.values.reshape(-1, 1)
    x_scaled = scaler.fit_transform(x_scaled)

    #split data for fitting and predicting with the model
    X_train, X_test, y_train, y_test = train_test_split(x_scaled, y_scaled, test_size=0.33, random_state=42)

    # create model
    reg = linear_model.LinearRegression()

    # fit model
    reg.fit(X_train, y_train)

    # make prediction
    y_pred = reg.predict(X_test)

    # check the mean squared error
    mse = mean_squared_error(y_test, y_pred)

    # check the score function
    r2s = r2_score(y_test, y_pred)
    #print feature labels
    print(labels[i])

    # print mse
    print(f'Mean squared error: {mse}')

    # 1 equals perfect prediction
    print(f'Coefficient of determination: {r2s}')

    # plot the visuals for a sanity check
    plt.plot(x, y, label=labels[i])
    plt.title(f'{labels[i]} effect or YouTube views')
    plt.xlabel = "views"
    plt.ylabel = labels[i]
    plt.legend()
    plt.plot()
Kaggle Notebook Output by Author
Kaggle Notebook Output by Author

Both the visual output and the model’s output point to a higher correlation between likes and video views than dislikes and comment counts.

Running code in Text Editor or IDE

If you want to run this in a Text Editor or IDE like PyCharm you can use the code in this Gist from my GitHub Repository.


Conclusion

While OLS Linear Regression can be good for identifying more basic correlations and insights into some data, it can sometimes be limited, inaccurate, or restrictive to certain types of problems. Make sure you choose the right model for the right task. This is a simple example to explore the basics of Data Science in Python. This article is also serving as a brief introduction to Kaggle Notebooks and Datasets, as well as sklearn and OLS Linear Regression. I hope this helped anyone beginning with Data Science in Python. Thank you and happy Coding!


Related Articles