FREAKOMBINATION: Logistic Regression (2)

Cost function for logistic regression

Fit θ parameters
Define the optimization object for the cost function we use the fit the parameters
- Training set of m training examples
- - Each example has is n+1 length column vector

Instead of writing the squared error term, we can write
- If we define "cost()" as;
  - cost(h_θ(xⁱ), y) = 1/2(h_θ(xⁱ) - yⁱ)²
  - Which evaluates to the cost for an individual example using the same measure as used in linear regression
- We can redefine J(θ) as
  - Which, appropriately, is the sum of all the individual costs over the training data (i.e. the same as linear regression)
To further simplify it we can get rid of the superscripts
- So

What does this actually mean?
- This is the cost you want the learning algorithm to pay if the outcome is h_θ(x) and the actual outcome is y
- If we use this function for logistic regression this is a non-convex function for parameter optimization
  - Could work....
What do we mean by non convex?
- We have some function - J(θ) - for determining the parameters
- Our hypothesis function has a non-linearity (sigmoid function of h_θ(x) )
  - This is a complicated non-linear function
- If you take h_θ(x) and plug it into the Cost() function, and them plug the Cost() function into J(θ) and plot J(θ) we find many local optimum -> non convex function
- Why is this a problem
  - Lots of local minima mean gradient descent may not find the global optimum - may get stuck in a global minimum
- We would like a convex function so if you run gradient descent you converge to a global minimum

A convex logistic regression cost function

To get around this we need a different, convex Cost() function which means we can apply gradient descent

So when we're right, cost function is 0
- Else it slowly increases cost function as we become "more" wrong
- X axis is what we predict
- Y axis is the cost associated with that prediction
This cost functions has some interesting properties
- If y = 1 and h_θ(x) = 1
  - If hypothesis predicts exactly 1 and thats exactly correct then that corresponds to 0 (exactly, not nearly 0)
- As h_θ(x) goes to 0
  - Cost goes to infinity
  - This captures the intuition that if h_θ(x) = 0 (predict P (y=1|x; θ) = 0) but y = 1 this will penalize the learning algorithm with a massive cost
What about if y = 0
then cost is evaluated as -log(1- h_θ( x ))
- Just get inverse of the other function

Now it goes to plus infinity as h_θ(x) goes to 1
With our particular cost functions J(θ) is going to be convex and avoid local minimum

Simplified cost function and gradient descent

Define a simpler way to write the cost function and apply gradient descent to the logistic regression
- By the end should be able to implement a fully functional logistic regression function
Logistic regression cost function is as follows

Why do we chose this function when other cost functions exist?
- This cost function can be derived from statistics using the principle of maximum likelihood estimation
  - Note this does mean there's an underlying Gaussian assumption relating to the distribution of features
- Also has the nice property that it's convex
To fit parameters θ:
- Find parameters θ which minimize J(θ)
- This means we have a set of parameters to use in our model for future predictions
Then, if we're given some new example with set of features x, we can take the θ which we generated, and output our prediction using
- This result is
- - p(y=1 | x ; θ)
  - - Probability y = 1, given x, parameterized by θ

How to minimize the logistic regression cost function

Now we need to figure out how to minimize J(θ)
- Use gradient descent as before
- Repeatedly update each parameter using a learning rate

If you had n features, you would have an n+1 column vector for θ
This equation is the same as the linear regression rule
- The only difference is that our definition for the hypothesis has changed
Previously, we spoke about how to monitor gradient descent to check it's working
- Can do the same thing here for logistic regression
When implementing logistic regression with gradient descent, we have to update all the θ values (θ₀ to θ_n) simultaneously
- Could use a for loop
- Better would be a vectorized implementation
Feature scaling for gradient descent for logistic regression also applies here

FREAKOMBINATION