2016년 3월 22일 화요일

Logistic Regression (1)

Classification
  • Where y is a discrete value
    • Develop the logistic regression algorithm to determine what class a new input should fall into
  • Classification problems
    • Email -> spam/not spam?
    • Online transactions -> fraudulent?
    • Tumor -> Malignant/benign
  • Variable in these problems is Y
    • Y is either 0 or 1
      • 0 = negative class (absence of something)
      • 1 = positive class (presence of something)
  • Start with binary class problems
    • Later look at multiclass classification problem, although this is just an extension of binary classification
  • How do we develop a classification algorithm?
    • Tumour size vs malignancy (0 or 1)
    • We could use linear regression
      • Then threshold the classifier output (i.e. anything over some value is yes, else no)
      • In our example below linear regression with thresholding seems to work
  • We can see above this does a reasonable job of stratifying the data points into one of two classes
    • But what if we had a single Yes with a very small tumour 
    • This would lead to classifying all the existing yeses as nos
  • Another issues with linear regression
    • We know Y is 0 or 1
    • Hypothesis can give values large than 1 or less than 0
  • So, logistic regression generates a value where is always either 0 or 1
    • Logistic regression is a classification algorithm - don't be confused
Hypothesis representation
  • What function is used to represent our hypothesis in classification
  • We want our classifier to output values between 0 and 1
    • When using linear regression we did hθ(x) = (θT x)
    • For classification hypothesis representation we do hθ(x) = g((θT x))
      • Where we define g(z)
        • z is a real number
      • g(z) = 1/(1 + e-z)
        • This is the sigmoid function, or the logistic function
      • If we combine these equations we can write out the hypothesis as
  • What does the sigmoid function look like
  • Crosses 0.5 at the origin, then flattens out]
    • Asymptotes at 0 and 1
  • Given this we need to fit θ to our data
Interpreting hypothesis output
  • When our hypothesis (hθ(x)) outputs a number, we treat that value as the estimated probability that y=1 on input x
    • Example
      • If X is a feature vector with x0 = 1 (as always) and x1 = tumourSize
      • hθ(x) = 0.7
        • Tells a patient they have a 70% chance of a tumor being malignant
    • We can write this using the following notation
      • hθ(x) = P(y=1|x ; θ)
    • What does this mean?
      • Probability that y=1, given x, parameterized by θ
  • Since this is a binary classification task we know y = 0 or 1
    • So the following must be true
      • P(y=1|x ; θ) + P(y=0|x ; θ) = 1
      • P(y=0|x ; θ) = 1 - P(y=1|x ; θ)

Decision boundary
  • Gives a better sense of what the hypothesis function is computing
  • Better understand of what the hypothesis function looks like
    • One way of using the sigmoid function is;
      • When the probability of y being 1 is greater than 0.5 then we can predict y = 1
      • Else we predict y = 0
    • When is it exactly that hθ(x) is greater than 0.5?
      • Look at sigmoid function
        • g(z) is greater than or equal to 0.5 when z is greater than or equal to 0
      • So if z is positive, g(z) is greater than 0.5
        • z = (θT x)
      • So when 
        • θT x >= 0 
      • Then hθ >= 0.5
  • So what we've shown is that the hypothesis predicts y = 1 when θT x >= 0 
    • The corollary of that when θT x <= 0 then the hypothesis predicts y = 0 
    • Let's use this to better understand how the hypothesis makes its predictions
Decision boundary
  • hθ(x) = g(θ0 + θ1xθ2x2)



  • So, for example
    • θ0 = -3
    • θ1 = 1
    • θ2 = 1
  • So our parameter vector is a column vector with the above values
    • So, θT is a row vector = [-3,1,1]
  • What does this mean?
    • The z here becomes θT x
    • We predict "y = 1" if
      • -3x0 + 1x1 + 1x2 >= 0
      • -3 + x1 + x2 >= 0
  • We can also re-write this as
    • If (x1 + x2 >= 3) then we predict y = 1
    • If we plot
      • x1 + x2 = 3 we graphically plot our decision boundary

  • Means we have these two regions on the graph
    • Blue = false
    • Magenta = true
    • Line = decision boundary
      • Concretely, the straight line is the set of points where hθ(x) = 0.5 exactly
    • The decision boundary is a property of the hypothesis
      • Means we can create the boundary with the hypothesis and parameters without any data
        • Later, we use the data to determine the parameter values
      • i.e. y = 1 if
        • 5 - x1 > 0
        • 5 > x1
Non-linear decision boundaries
  • Get logistic regression to fit a complex non-linear data set
    • Like polynomial regress add higher order terms
    • So say we have
      • hθ(x) = g(θ0 + θ1x1θ3x12 + θ4x22)
      • We take the transpose of the θ vector times the input vector 
        • Say θT was [-1,0,0,1,1] then we say;
        • Predict that "y = 1" if
          • -1 + x12 + x22 >= 0
            or
          • x12 + x22 >= 1
        • If we plot x12 + x22 = 1
          • This gives us a circle with a radius of 1 around 0
  • Mean we can build more complex decision boundaries by fitting complex parameters to this (relatively) simple hypothesis
  • More complex decision boundaries?
    • By using higher order polynomial terms, we can get even more complex decision boundaries

<source: http://www.holehouse.org, Coursera Machine learning class of Andrew Ng>

댓글 없음:

댓글 쓰기