FREAKOMBINATION: Logistic Regression (1)

Classification

Where y is a discrete value
- Develop the logistic regression algorithm to determine what class a new input should fall into
Classification problems
- Email -> spam/not spam?
- Online transactions -> fraudulent?
- Tumor -> Malignant/benign
Variable in these problems is Y
- Y is either 0 or 1
  - 0 = negative class (absence of something)
  - 1 = positive class (presence of something)
Start with binary class problems
- Later look at multiclass classification problem, although this is just an extension of binary classification
How do we develop a classification algorithm?
- Tumour size vs malignancy (0 or 1)
- We could use linear regression
- - Then threshold the classifier output (i.e. anything over some value is yes, else no)
  - In our example below linear regression with thresholding seems to work

We can see above this does a reasonable job of stratifying the data points into one of two classes
- But what if we had a single Yes with a very small tumour
- This would lead to classifying all the existing yeses as nos
Another issues with linear regression
- We know Y is 0 or 1
- Hypothesis can give values large than 1 or less than 0
So, logistic regression generates a value where is always either 0 or 1
- Logistic regression is a classification algorithm - don't be confused

Hypothesis representation

Interpreting hypothesis output

When our hypothesis (h_θ(x)) outputs a number, we treat that value as the estimated probability that y=1 on input x
- Example
  - If X is a feature vector with x₀ = 1 (as always) and x₁ = tumourSize
  - h_θ(x) = 0.7
    - Tells a patient they have a 70% chance of a tumor being malignant
- We can write this using the following notation
  - h_θ(x) = P(y=1|x ; θ)
- What does this mean?
  - Probability that y=1, given x, parameterized by θ
Since this is a binary classification task we know y = 0 or 1
- So the following must be true
  - P(y=1|x ; θ) + P(y=0|x ; θ) = 1
  - P(y=0|x ; θ) = 1 - P(y=1|x ; θ)

Decision boundary

Gives a better sense of what the hypothesis function is computing
Better understand of what the hypothesis function looks like
- One way of using the sigmoid function is;
  - When the probability of y being 1 is greater than 0.5 then we can predict y = 1
  - Else we predict y = 0
- When is it exactly that h_θ(x) is greater than 0.5?
  - Look at sigmoid function
    - g(z) is greater than or equal to 0.5 when z is greater than or equal to 0
  - So if z is positive, g(z) is greater than 0.5
    - z = (θ^T x)
  - So when
    - θ^T x >= 0
  - Then h_θ >= 0.5
So what we've shown is that the hypothesis predicts y = 1 when θ^T x >= 0
- The corollary of that when θ^T x <= 0 then the hypothesis predicts y = 0
- Let's use this to better understand how the hypothesis makes its predictions

Decision boundary

So, for example
- θ₀ = -3
- θ₁ = 1
- θ₂ = 1
So our parameter vector is a column vector with the above values
- So, θ^T is a row vector = [-3,1,1]
What does this mean?
- The z here becomes θ^T x
- We predict "y = 1" if
  - -3x₀ + 1x₁ + 1x₂ >= 0
  - -3 + x₁ + x₂ >= 0
We can also re-write this as
- If (x₁ + x₂ >= 3) then we predict y = 1
- If we plot
  - x₁ + x₂ = 3 we graphically plot our decision boundary

Non-linear decision boundaries

Mean we can build more complex decision boundaries by fitting complex parameters to this (relatively) simple hypothesis
More complex decision boundaries?
- By using higher order polynomial terms, we can get even more complex decision boundaries

<source: http://www.holehouse.org, Coursera Machine learning class of Andrew Ng>

FREAKOMBINATION