The world’s leading publication for data science, AI, and ML professionals.

Understanding Regression using COVID-19 Dataset – Detailed Analysis

COVID cases count dependent on testing? Let's check it out?

Are counts of confirmed COVID cases dependent on testing?

Every number tells a story (Image by Unsplash)
Every number tells a story (Image by Unsplash)

Hope you all are safe and healthy! Covid-19 has totally changed the course of 2020, shrinking the global economy by a huge rate. In this blog, we will deep dive into Regression Analysis and how we can use this to infer mindboggling insights using Chicago COVID dataset. Some intuition of both calculus and Linear Algebra will make your journey easier. Welcome to my first blog!

Concept

Linear Regression is not just a machine learning algorithm, it plays a huge role in statistics. Coming from a family of supervised learning, each input here is associated with a target label. The task of the model is to basically understand the pattern and find the best fit line that covers each (input, target) pair.

To give an overview, ML models can be classified on the basis of the task performed and the nature of the output:

Regression: Output is a continuous variable. Classification: Output is a categorical variable. Clustering: No notion of output.

Regression & Classification fall under supervised learning while Clustering fall under unsupervised learning.

Regression is a form of predictive modeling technique where we try to find a significant relationship between a dependent variable and one or more independent variables. There are various types of regression techniques: Linear, Logistic, Polynomial, Ridge, Lasso, Softmax.

Linear Regression(LR)

By the term itself, you can guess that the model functions along a straight line, where y can be calculated from a linear combination of the input variables x. Linear Regression hypothesis function can be formulated as:

Linear Regression hypothesis function ( Image by Author)
Linear Regression hypothesis function ( Image by Author)

Θ here stores the coefficients/weights of the input features x and is of the exact same dimensionality as x. Note that to add support for a constant term in our model, we prefix the vector x with 1.

With single input and output variable, the method is called simple linear regression while with multiple inputs/features, its called multiple linear regression. For both, our goal would be to find the best fitting line that minimizes the sum of squared errors (SSE) or mean squared error (MSE) between our target variable (y) and our predicted output across all samples.

SSE & MSE Calculation ( Image by Author)
SSE & MSE Calculation ( Image by Author)

To proceed, we need to find the parameters/coefficients for that best fit line. There are various techniques to arrive at the geometric equation for the line such as least absolute deviations (minimizing the sum of absolute values of residuals) and the Theil–Sen estimator (which finds a line whose slope is the median of the slopes determined by pairs of sample points), however, statisticians typically use the Ordinary Least Squares method (OLS). OLS is nothing but a method to minimize the distances between the line and the actual outputs. If you want to calculate the regression line by hand, it uses a slightly scary formula to find the slope ‘θ1‘ and line intercept ‘θ0‘.

Finding intercept and slope using OLS ( Image by Author)
Finding intercept and slope using OLS ( Image by Author)

There are 2 approaches/solutions that uses the Least Squares method to implement a linear regression model:

  • Closed-form solution – Normal Equation
  • An optimization algorithm (Gradient Descent, Newton’s Method, etc.)

Normal Equation

It’s important to understand that,θ, here essentially can break or make the model. Our aim is to basically look for the θ that helps in achieving the lowest cost of the model.

The famous Cost Function ( Image by Author)
The famous Cost Function ( Image by Author)

The choice of objective/cost function J(θ) can vary greatly depending on the problem at hand. In general, the mean squared error is best for regression and cross-entropy is best for classification. So, coming back to Normal Equation, it uses an analytical approach to find the parameters of the equation. It took some time for me to understand why things are happening the way they are. Hopefully, this short proof of it will get your concepts in hand.

If we take a residual vector as e =Xθ – y, then cost function/sum of squared residuals in a vector form will be :

Normal Equation Derivation ( Image by Author)
Normal Equation Derivation ( Image by Author)

For someone new to linear algebra, note that we cannot simply square Xθ – y. Vector multiplication property states that the square of a vector/matrix is not equal to the square of each of its values. So to get the squared value, we multiply the vector with its transpose.

Now to find the θ that minimizes the sum of squared residuals, we need to take the derivative of cost function above with respect to θ.

Normal Equation estimated coefficient θ vector( Image by Author)
Normal Equation estimated coefficient θ vector( Image by Author)

That’s all that Normal Equation offers. A very direct and elegant approach to find the exact θ parameters that would do wonders to fit your data! However, there is a catch here. The matrix inverse function would not go well with larger datasets (large X value) or datasets where the inverse may not exist (the matrix is non-invertible or singular). Even the most efficient inverse algorithm that exists today gives us cubic time complexity. Normal Equation solution can be preferred for "smaller" datasets – if computing a "costly" matrix inverse is not a concern but for real-world datasets, approaches such as Gradient descent or SGD are preferred.

Gradient Descent

The goal here is similar to what we had for Normal Equation. Gradient descent algorithm figures out a minimal cost function by applying various parameters for theta 0 and theta 1 until it reaches convergence. It is considered to be one of the best iterative optimization algorithms to minimize residual errors.

Image by Author
Image by Author

Intuition

Think of a bowl or any convex shape object as your cost function. If you throw any object from the sides of the bowl, it will take the shortest best route and will reach the bottom. That bottom point will provide us with the lowest cost and the θ parameter to attain the best-fitted model.

Few challenges might come if we have many local minima instead of one global minima. Cost functions are not usually a regular bowl. It might stop too early or take indefinite time to converge. But fortunately! The cost function in Linear Regression is convex, so no line passing between 2 points meets the curve.

It’s a very straightforward process. The procedure of GD starts with initializing some random θ values. We calculate the cost by plugging those values into the cost function. The derivate of the cost function will give us the slope at that point so that we know in which direction to head next in order to get further lower-cost. How does that happen? We use that derivative value to update all the θ coefficients at each step until we reach the lowest level and there is no more cost decrease. SIMPLE! Isn’t it?

Partial derivatives of the Cost function (Image by Author)
Partial derivatives of the Cost function (Image by Author)

To compute all of them in one go, we can use the gradient vector that contains the partial derivates for each model parameter.

Gradient Vector of the Cost function (Image by Author)
Gradient Vector of the Cost function (Image by Author)

Note that this gradient vector points uphill, so we need to go in the opposite direction for downhill. Once we have the gradient vector, we need to find the learning rate η to identify the amount of the downhill step.

Weight Update after each step (Image by Author)
Weight Update after each step (Image by Author)

There are few things needed to be kept in mind while trying different values of eta η. If the learning rate is too low, it might take an indefinite time to converge to the optimal value. However, keeping it high might make it jump on the other side and oscillate indefinitely. So, η should be kept somewhere in between. To find a good learning rate, we can apply grid search. Through this, we try and test various η values keeping a restriction on the number of iterations. The one that converges faster will be a good learning rate for the problem. This method is called Batch Gradient Descent since it performs calculations over full training set X.

There are two more types of Gradient Descent with not much difference –

Stochastic Gradient Descent

Batch Gradient Descent would blow up if we have very large datasets. It can take forever to complete. Stochastic works in a similar fashion as Batch GD, however here it picks a random instance in the training set and computes the gradient-based only on that instance. All the other calculations will be the same. It is faster than Batch since we have fewer data at the time of manipulation.

Yet, this algorithm is not regularly used since it takes instances randomly which makes the cost function go high-low repeatedly. Even if it converges to the optimum, it will continue to bounce around. Also, it often picks instances more than once so it is suggested to keep shuffling the data. TIP: So, it is better to have a greater learning rate initially and then reduce it slowly.

Mini Batch Gradient Descent

It takes small random sets of instances. But sometimes, it may get harder to escape from local minima. It is better than Stochastic Gradient Descent in terms of irregularity.

SUMMARY

All three end up near minimum, but Batch Gradient Descent actually stops at the minimum.

Batch Gradient Descent will take longer with larger datasets. Stochastic and Mini Batch would reach the minimum faster if we know how to alter the learning rate.

Comparison between GD, SGD and Mini GD (Source: Book [Aur_lien_G_ron]_Hands-On_Machine_Learning_with_Sc)
Comparison between GD, SGD and Mini GD (Source: Book [Aur_lien_G_ron]_Hands-On_Machine_Learning_with_Sc)

Linear Regression Model Example – COVID

Let us first walk through Chicago’s COVID-19 dataset. You can download the dataset from here.

Dataset ( Image by Author)
Dataset ( Image by Author)

80% of the COVID datasets available on the web are in a time series format displaying the counts of cases on a daily basis. So, with just death and test counts, I could only visualize whether the peak has reached or if it is still increasing and so on. But to forecast, I wanted to have features on which the case count can have some dependency. We also know that there are many countries where there is inadequate testing done, thus showing less number of cases. Thus, I focused on the datasets where I can get tests and case features to figure out if there is a huge correlation between the two.

Fortunately, I found this small-sized dataset of Chicago with features such as tests vs cases count. You can observe how the data is displaying a slightly linear pattern on the scatter plot, showing the correlation.

Scatter plot diagram ( Image by Author)
Scatter plot diagram ( Image by Author)

As you can see, the independent variable is Tests (x) here and dependent is Cases (y). We have a few more features in the dataset, giving the count of people who are older than 30 or younger and those who are Latin, etc. I personally couldn’t deduce what we could analyze with such features. You can try giving it a shot and let me know if we can infer something.

Let’s start coding 🙂

  • Import all the required libraries
  • pandas & numPy for performing scientific operations
  • Linear Regression & Polynomial Features for building the models
  • train_test_split to divide the dataset into train and test subsets
  • MSE & r2_score metrics to analyze the model performance
  • seaborn and pyplot to visualize the graphs
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import seaborn as sns
import matplotlib.pyplot as plt
  • Import the CSV dataset using pandas. It outputs a dataframe, 2-D labeled data structure. data.head() displays the top 5 rows in the dataset to give us an idea if the object has the right type of data in it.
data = pd.read_csv("COVID-19_Daily_Testing.csv")
data.head()
  • To get a concise summary of the dataset. Info() method prints information about a DataFrame such as index data type and column data types, non-null values and memory usage.
print(data.info())
  • Time for some data cleaning!Cases and Tests count are in strings format, because of which we have a comma in thousand values ( e.g 1,468) p_d.to_numeric() w_ill convert the type from string to int32.
data['Cases'] = data['Cases'].str.replace(',', '')
data['Tests'] = data['Tests'].str.replace(',', '')
data['Cases'] = pd.to_numeric(data['Cases'])     
data['Tests'] = pd.to_numeric(data['Tests'])
  • Often, in real-world datasets, there are a lot of features, and it is difficult to check which pair of features are showing a good correlation. Also, to check if there is any multilinearity problem involved. The below code basically plots pairwise relationships in a dataset such that each variable will be shared both in the x and y-axis.
data_numeric = data.select_dtypes(include=['float64', 'int64'])
plt.figure(figsize=(20, 10))
sns.pairplot(data_numeric)
plt.show()
Feature pair plot ( Image by Author)
Feature pair plot ( Image by Author)

Mostly all of them show a linear pattern that should not happen! This gives rise to the famous multilinearity problem.

The Multilinearity problem is when independent variables in a regression model show correlation. Independent variables should be independent. In layman terms, you can think of several variables essentially measuring the same thing. There is no need to have more than one variable to measure the same thing in a model. This basically reduces the precision of the estimate coefficients θ, undermining the statistical power of your regression model.

To diagnose multicollinearity, we use a measure called a ** variance inflation factor (VIF). Variance inflation factors (VIF) measures how much the variance of the estimated regression coefficients are inflated as compared to when the predictor variables are not linearly related. Rule of thumb is if we have VIF>1**0, the model has high multilinearity.

You can check in the above graph that we have high multilinearity involved in our dataset. The good thing is I have taken only 2 features in my model but in the future, if you create any regression model, check VIFs and drop the independent variables that are correlated with each other.

  • Scaling the datasetThe feature values might have different scales. So, it is always preferred to convert them on the same scale for better analysis and prediction
X = data['Tests'].values.reshape(-1,1)
y = data['Cases'].values.reshape(-1,1)
  • Applying Linear Regression
reg = LinearRegression()
reg.fit(X, y)
predictions = reg.predict(X)
print("The linear model is: Y = {:.5} + {:.5}X".format(reg.intercept_[0], reg.coef_[0][0]))
plt.figure(figsize=(16, 8))
plt.scatter(
    X,
    y,
    c='black'
)
plt.plot(
    X,
    predictions,
    c='blue',
    linewidth=2
)
plt.xlabel("Tests")
plt.ylabel("Cases")
plt.show()

We get this linear line Y = 97.777 + 0.18572X

Linear Regression Plot ( Image by Author)
Linear Regression Plot ( Image by Author)

The Root Mean Square Error for Linear Regression => 171.8

As you can see that we have many outliers because of which its not fitting the data well. In cases where the data is complex, we can try to reduce the RMSE by applying Polynomial Regression.

Polynomial Regression

Special Case of Linear Regression where we try to fit an n degree polynomial equation to the data. It outperforms LR in the capability of finding relationships between features. Here, we add powers to each feature and append them to our existing features. The training will be done on this extended set of features to get a curvilinear relationship between the target variable and the independent variables.

poly = PolynomialFeatures(degree =4) 
X_poly = poly.fit_transform(X) 

poly.fit(X_poly, y) 
lin2 = LinearRegression() 
lin2.fit(X_poly, y) 
pred = lin2.predict(X_poly)
new_X, new_y = zip(*sorted(zip(X, pred)))
plt.figure(figsize=(16, 8))
plt.scatter(
    X,
    y,
    c='black'
)
plt.plot(
    new_X, new_y,
    c='blue'
)
plt.xlabel("Tests")
plt.ylabel("Cases")
plt.show()

Scikit learn PolynomialFeatures class consists of a degree parameter that we can add as the maximum power applied to the data. You need to try and test which degrees minimizes RMSE to the most. After trying different values of degree, I found degree 4 is giving me the best solution.

Polynomial Regression plot ( Image by Author)
Polynomial Regression plot ( Image by Author)

The Root Mean Square Error for Polynomial Regression=> 131.08.Observe how this plot perfectly fits the data, at least better than Linear Regression. Also, the RMSE decreased to a great extent which gives another proof of its goodness. There are a lot more concepts required to get the performance of a model such as whether it is overfitting or underfitting, the bias/variance trade-off. We will surely cover that in the next blog!

Evaluation Metrics

A RMSE of 0 means that your model is a perfect predictor of the outputs (but this will almost never happen).

There are various kinds of evaluation metrics for regression & classification problems. RMSE is one of them which should be used if the errors are very large and undesirable since it compensates that by taking a square root.

Another metrics we have is R Squared(R2) which gives the relative measure of the overall model fitting procedure of a regression line used for predicting. Its value ranges between 0 and 1. Higher the value, better the model. The R² can be thought of as representing the percent of variance explained by the model. It has a better version called Adjusted R Squared that even takes into consideration if the features are dropped or added. R Squared either stays constant on the changes or increases. Adjusted R Squared takes changes into consideration which is why this metric is more reliable when you have a lot of features to experiment with.


And that’s it for now.

Hope you got a good understanding of finding a perfectly fitted line in a real-world scenario! Regression happens everywhere, its time to start noticing it! If you have any questions/thoughts, feel free to leave your feedback in the comment section below or you can reach me on Linkedin. Till then, wait for the next post! 😀

References

OLS Regression

Linear Regression Simplified – Ordinary Least Square vs Gradient Descent


Related Articles