# Logistic Regression

“It is more important to have beauty in one’s equations than to have them fit experiment…” — Paul Dirac

This is also know as ‘** Classification**’. This is used in many scenarios where we want to categories input in predefined classes. For ex. tag email as spam/non spam, predict the age group of a customer from the data of commerce portal etc.

In

**the output domain is a continues range, i.e. it’s a infinite set, while in logistic regression the output**

*linear regression***we want to predict takes only a small no of discrete values. i.e. it’s Finite Set. For simplicity lets consider a binary classification where y can take only two values, 1 (positive) and 0 (negative).**

*y*Just like linear regression we need to start with a hypothesis. As the output domain is bounded (0,1) it doesn’t make sense to have a hypothesis which produces value beyond this range.

Given the above set of logistic regression models (why set? because theta is variable) we need to find the co-efficient theta for the best fit model which best explains the training set. For that we need to start with a set of probabilistic assumptions parameterised by theta and then find the theta via **Maximum Likelihood**.

Lets start with **Bernoulli distribution **, the probability distribution of a random variable which takes the value of 1 with probability p and value 0 with probability q= 1-p.

In *linear regression** **we*** **find the coefficients by equating the derivative of log likelihood to zero. We evaluated the derivative of likelihood just like we did but the resultant Ex(3) is not a mathematically closed equation that we can solve. (Remember x and theta both are vectors in the eq and h is a non linear function)

We can still find the coefficient by using a brute force algorithm called

**where we start with some**

*Gradient Ascent.***coefficient and then keep updating theta iteratively until the likelihood function converges.**

**Example**Let’s take the wikipedia example

Suppose we wish to answer the following question:

A group of 20 students spend between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability that the student will pass the exam?