Logistic Regression

“It is more important to have beauty in one’s equations than to have them fit experiment…” — Paul Dirac

This is also know as ‘Classification’. This is used in many scenarios where we want to categories input in predefined classes. For ex. tag email as spam/non spam, predict the age group of a customer from the data of commerce portal etc.
In linear regression the output domain is a continues range, i.e. it’s a infinite set, while in logistic regression the output y we want to predict takes only a small no of discrete values. i.e. it’s Finite Set. For simplicity lets consider a binary classification where y can take only two values, 1 (positive) and 0 (negative). 
Just like linear regression we need to start with a hypothesis. As the output domain is bounded (0,1) it doesn’t make sense to have a hypothesis which produces value beyond this range.

plot of f(x) for x belongs to (-10, 10)

Given the above set of logistic regression models (why set? because theta is variable) we need to find the co-efficient theta for the best fit model which best explains the training set. For that we need to start with a set of probabilistic assumptions parameterised by theta and then find the theta via Maximum Likelihood
Lets start with Bernoulli distribution , the probability distribution of a random variable which takes the value of 1 with probability p and value 0 with probability q= 1-p.

In linear regression we find the coefficients by equating the derivative of log likelihood to zero. We evaluated the derivative of likelihood just like we did but the resultant Ex(3) is not a mathematically closed equation that we can solve. (Remember x and theta both are vectors in the eq and h is a non linear function)
We can still find the coefficient by using a brute force algorithm called Gradient Ascent. where we start with some coefficient and then keep updating theta iteratively until the likelihood function converges.

Let’s take the wikipedia example 
Suppose we wish to answer the following question:

A group of 20 students spend between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability that the student will pass the exam?
Plot of derived model for the range (0,6)