- Where y is a discrete value
- Develop the logistic regression algorithm to determine what class a new input should fall into
- Classification problems
- Email -> spam/not spam?
- Online transactions -> fraudulent?
- Tumor -> Malignant/benign
- Variable in these problems is Y
- Y is either 0 or 1
- 0 = negative class (absence of something)
- 1 = positive class (presence of something)
- Y is either 0 or 1
- Start with binary class problems
- Later look at multiclass classification problem, although this is just an extension of binary classification
- How do we develop a classification algorithm?
- Tumour size vs malignancy (0 or 1)
- We could use linear regression
- Then threshold the classifier output (i.e. anything over some value is yes, else no)
- In our example below linear regression with thresholding seems to work
- We can see above this does a reasonable job of stratifying the data points into one of two classes
- But what if we had a single Yes with a very small tumour
- This would lead to classifying all the existing yeses as nos
- Another issues with linear regression
- We know Y is 0 or 1
- Hypothesis can give values large than 1 or less than 0
- So, logistic regression generates a value where is always either 0 or 1
- Logistic regression is a classification algorithm - don't be confused
Hypothesis representation
- What function is used to represent our hypothesis in classification
- We want our classifier to output values between 0 and 1
- When using linear regression we did hθ(x) = (θT x)
- For classification hypothesis representation we do hθ(x) = g((θT x))
- Where we define g(z)
- z is a real number
- g(z) = 1/(1 + e-z)
- This is the sigmoid function, or the logistic function
- If we combine these equations we can write out the hypothesis as
- Where we define g(z)
- What does the sigmoid function look like
- Crosses 0.5 at the origin, then flattens out]
- Asymptotes at 0 and 1
- Given this we need to fit θ to our data
- When our hypothesis (hθ(x)) outputs a number, we treat that value as the estimated probability that y=1 on input x
- Example
- If X is a feature vector with x0 = 1 (as always) and x1 = tumourSize
- hθ(x) = 0.7
- Tells a patient they have a 70% chance of a tumor being malignant
- We can write this using the following notation
- hθ(x) = P(y=1|x ; θ)
- What does this mean?
- Probability that y=1, given x, parameterized by θ
- Example
- Since this is a binary classification task we know y = 0 or 1
- So the following must be true
- P(y=1|x ; θ) + P(y=0|x ; θ) = 1
- P(y=0|x ; θ) = 1 - P(y=1|x ; θ)
- So the following must be true
Decision boundary
- Gives a better sense of what the hypothesis function is computing
- Better understand of what the hypothesis function looks like
- One way of using the sigmoid function is;
- When the probability of y being 1 is greater than 0.5 then we can predict y = 1
- Else we predict y = 0
- When is it exactly that hθ(x) is greater than 0.5?
- Look at sigmoid function
- g(z) is greater than or equal to 0.5 when z is greater than or equal to 0
- g(z) is greater than or equal to 0.5 when z is greater than or equal to 0
- So if z is positive, g(z) is greater than 0.5
- z = (θT x)
- So when
- θT x >= 0
- Then hθ >= 0.5
- Look at sigmoid function
- One way of using the sigmoid function is;
- So what we've shown is that the hypothesis predicts y = 1 when θT x >= 0
- The corollary of that when θT x <= 0 then the hypothesis predicts y = 0
- Let's use this to better understand how the hypothesis makes its predictions
- hθ(x) = g(θ0 + θ1x1 + θ2x2)
- So, for example
- θ0 = -3
- θ1 = 1
- θ2 = 1
- So our parameter vector is a column vector with the above values
- So, θT is a row vector = [-3,1,1]
- What does this mean?
- The z here becomes θT x
- We predict "y = 1" if
- -3x0 + 1x1 + 1x2 >= 0
- -3 + x1 + x2 >= 0
- We can also re-write this as
- If (x1 + x2 >= 3) then we predict y = 1
- If we plot
- x1 + x2 = 3 we graphically plot our decision boundary
- Means we have these two regions on the graph
- Blue = false
- Magenta = true
- Line = decision boundary
- Concretely, the straight line is the set of points where hθ(x) = 0.5 exactly
- The decision boundary is a property of the hypothesis
- Means we can create the boundary with the hypothesis and parameters without any data
- Later, we use the data to determine the parameter values
- i.e. y = 1 if
- 5 - x1 > 0
- 5 > x1
- Means we can create the boundary with the hypothesis and parameters without any data
- Get logistic regression to fit a complex non-linear data set
- Like polynomial regress add higher order terms
- So say we have
- hθ(x) = g(θ0 + θ1x1+ θ3x12 + θ4x22)
- We take the transpose of the θ vector times the input vector
- Say θT was [-1,0,0,1,1] then we say;
- Predict that "y = 1" if
- -1 + x12 + x22 >= 0
or - x12 + x22 >= 1
- -1 + x12 + x22 >= 0
- If we plot x12 + x22 = 1
- This gives us a circle with a radius of 1 around 0
- Mean we can build more complex decision boundaries by fitting complex parameters to this (relatively) simple hypothesis
- More complex decision boundaries?
- By using higher order polynomial terms, we can get even more complex decision boundaries
<source: http://www.holehouse.org, Coursera Machine learning class of Andrew Ng>
댓글 없음:
댓글 쓰기