Cost function for logistic regression
- Fit θ parameters
- Define the optimization object for the cost function we use the fit the parameters
- Training set of m training examples
- Each example has is n+1 length column vector
- This is the situation
- Set of m training examples
- Each example is a feature vector which is n+1 dimensional
- x0 = 1
- y ∈ {0,1}
- Hypothesis is based on parameters (θ)
- Given the training set how to we chose/fit θ?
- Linear regression uses the following function to determine θ
- Instead of writing the squared error term, we can write
- If we define "cost()" as;
- cost(hθ(xi), y) = 1/2(hθ(xi) - yi)2
- Which evaluates to the cost for an individual example using the same measure as used in linear regression
- We can redefine J(θ) as
- Which, appropriately, is the sum of all the individual costs over the training data (i.e. the same as linear regression)
- If we define "cost()" as;
- To further simplify it we can get rid of the superscripts
- So
- So
- What does this actually mean?
- This is the cost you want the learning algorithm to pay if the outcome is hθ(x) and the actual outcome is y
- If we use this function for logistic regression this is a non-convex function for parameter optimization
- Could work....
- What do we mean by non convex?
- We have some function - J(θ) - for determining the parameters
- Our hypothesis function has a non-linearity (sigmoid function of hθ(x) )
- This is a complicated non-linear function
- If you take hθ(x) and plug it into the Cost() function, and them plug the Cost() function into J(θ) and plot J(θ) we find many local optimum -> non convex function
- Why is this a problem
- Lots of local minima mean gradient descent may not find the global optimum - may get stuck in a global minimum
- We would like a convex function so if you run gradient descent you converge to a global minimum
A convex logistic regression cost function
- To get around this we need a different, convex Cost() function which means we can apply gradient descent
- This is our logistic regression cost function
- This is the penalty the algorithm pays
- Plot the function
- Plot y = 1
- So hθ(x) evaluates as -log(hθ(x))
- So when we're right, cost function is 0
- Else it slowly increases cost function as we become "more" wrong
- X axis is what we predict
- Y axis is the cost associated with that prediction
- This cost functions has some interesting properties
- If y = 1 and hθ(x) = 1
- If hypothesis predicts exactly 1 and thats exactly correct then that corresponds to 0 (exactly, not nearly 0)
- As hθ(x) goes to 0
- Cost goes to infinity
- This captures the intuition that if hθ(x) = 0 (predict P (y=1|x; θ) = 0) but y = 1 this will penalize the learning algorithm with a massive cost
- If y = 1 and hθ(x) = 1
- What about if y = 0
- then cost is evaluated as -log(1- hθ( x ))
- Just get inverse of the other function
- Now it goes to plus infinity as hθ(x) goes to 1
- With our particular cost functions J(θ) is going to be convex and avoid local minimum
Simplified cost function and gradient descent
- Define a simpler way to write the cost function and apply gradient descent to the logistic regression
- By the end should be able to implement a fully functional logistic regression function
- Logistic regression cost function is as follows
- This is the cost for a single example
- For binary classification problems y is always 0 or 1
- Because of this, we can have a simpler way to write the cost function
- Rather than writing cost function on two lines/two cases
- Can compress them into one equation - more efficient
- Can write cost function is
- cost(hθ, (x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) )
- This equation is a more compact of the two cases above
- cost(hθ, (x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) )
- We know that there are only two possible cases
- y = 1
- Then our equation simplifies to
- -log(hθ(x)) - (0)log(1 - hθ(x))
- -log(hθ(x))
- Which is what we had before when y = 1
- -log(hθ(x)) - (0)log(1 - hθ(x))
- Then our equation simplifies to
- y = 0
- Then our equation simplifies to
- -(0)log(hθ(x)) - (1)log(1 - hθ(x))
- = -log(1- hθ(x))
- Which is what we had before when y = 0
- Then our equation simplifies to
- Clever!
- y = 1
- So, in summary, our cost function for the θ parameters can be defined as
- Why do we chose this function when other cost functions exist?
- This cost function can be derived from statistics using the principle of maximum likelihood estimation
- Note this does mean there's an underlying Gaussian assumption relating to the distribution of features
- Also has the nice property that it's convex
- This cost function can be derived from statistics using the principle of maximum likelihood estimation
- To fit parameters θ:
- Find parameters θ which minimize J(θ)
- This means we have a set of parameters to use in our model for future predictions
- Then, if we're given some new example with set of features x, we can take the θ which we generated, and output our prediction using
- This result is
- p(y=1 | x ; θ)
- Probability y = 1, given x, parameterized by θ
How to minimize the logistic regression cost function
- Now we need to figure out how to minimize J(θ)
- Use gradient descent as before
- Repeatedly update each parameter using a learning rate
- If you had n features, you would have an n+1 column vector for θ
- This equation is the same as the linear regression rule
- The only difference is that our definition for the hypothesis has changed
- Previously, we spoke about how to monitor gradient descent to check it's working
- Can do the same thing here for logistic regression
- When implementing logistic regression with gradient descent, we have to update all the θ values (θ0 to θn) simultaneously
- Could use a for loop
- Better would be a vectorized implementation
- Feature scaling for gradient descent for logistic regression also applies here
<source:http://www.holehouse.org/mlclass/06_Logistic_Regression.html, Andrew Ng's Coursera Lectures>
댓글 없음:
댓글 쓰기