레이블이 regression인 게시물을 표시합니다. 모든 게시물 표시
레이블이 regression인 게시물을 표시합니다. 모든 게시물 표시

2018년 4월 9일 월요일

회귀 모델링의 전제조건

  • 데이터의 실측치와 모델의 추정치의 차이인 잔차가 i.i.d.를 만족해야함.
    • i.i.d. (Independent and Identically Distributed random)
      • 잔차의 분포는 정규분포
        • 확인방법
          • 잔차의 히스토그램
          •  Shapiro-Wilk normality test
            • 결과로 나온 p-value가 높으면 정규분포를 따름.
          • normal quantile plot
          • Skewness와 Kurtosis를 check
            • skewness(third standardized moment): 분포가 얼마나 기울어 있나
            • Kurtosis(fourth standardized moment): 분포가 얼마나 뚱뚱한가
          • 자크베라(Jacqa Beata) test
      • 잔차는 독립이어야 함
        • 잔차와 독립변수 사이에 상관관계가 없고, 자기 자신과도 상관이 없어야 함.
        • 확인 방법
          • 독립변수와 잔차의 correlation
          • scatter plot
      • 잔차의 분포가 일정해야 함. (잔차의 등분산성)
  • 출처(https://brunch.co.kr/@gimmesilver/17)이며, 읽고 간단히 키워드만 정리함.

2016년 5월 9일 월요일

주택가격 결정 요인: 분위수 회귀분석 - 헤도닉(hedonic) 특성을 고려한 분위수(quantile) 분석 방법

헤도닉(hedonic)특성: 주택은 물리적 특성이 같을 지라도 동일한 가격이 성립되지 않고 개별주택에 따라 가격 결정요인이 달라진다.

주택가격 분위수별로 주택의 헤도닉 특성이 주택가격에 미치는 효과를 구분하여 추정하기 위하여 분위수(quantile) 분석 방법을 사용.
분위수 회귀분석은 주택가격 분위수 별로 서로 다른 헤도닉적인 특성을 고려할 수 있으므로 이분산(heterogeneity) 문제를 해결할 수 있다.
헤도닉적인 특성은 주택의 물리적 성격, 주택의 인근 지역과 환경적 특성, 주택의 위치와 접근성으로 구분된다.

헤도닉특성이 미치는 효과가 가격이 낮은 분위수에서 미치는 효과와 가격이 높은 분위수에서 미치는 효과가 상이하게 나타날 것으로 예상하여, 주택가격의 분위수를 5%,10%,25%,50%,75%,90%,95%으로 구분하여 헤도닉특성과 주택가격의 상관관계를 파악.

변수의 정규분포 여부를 검정하는 Watson 통계량을 보면 자료가 정규분포를 보이지 않으므로 분위수 회귀부석이 필요하다.


아파트가격의 결정요인 중 헤도닉특성 요인으로서 아파트의 건축경과 연수(), 전용면적(m2), 총 층수, 거주층수, 남향여부 더미, 지하철 더미, 고등학교 더미, 산과 강의 풍경이 보이는 조망권 더미 등을 사용하였으며, AIC 기준에 따라 건축경과연 수와 전용면적의 제곱 변수를 추가적으로 포함하여 비선형적 모형을 추정하였다. 남향 여부 더미는 남향이면 1, 아니면 0으로, 지하철 더미는 지하철역까지 도보로 10분 이내 이면 1, 아니면 0으로 하였으며, 고등학교 더미는 고등학교까지 도보로 10분 이내이면 1, 아니면 0으로 하였다.

출처:  서울 주택가격의 결정요인: 분위수 회귀분석
Determinants of House Prices in Seoul: Quantile Regression Approach

2016년 3월 31일 목요일

Logistic Regression (2)

Cost function for logistic regression
  • Fit θ parameters
  • Define the optimization object for the cost function we use the fit the parameters
    • Training set of m training examples
      • Each example has is n+1 length column vector
  • This is the situation
    • Set of m training examples
    • Each example is a feature vector which is n+1 dimensional
    • x0 = 1
    • y ∈ {0,1}
    • Hypothesis is based on parameters (θ)
      • Given the training set how to we chose/fit θ?
  • Linear regression uses the following function to determine θ
  • Instead of writing the squared error term, we can write
    • If we define "cost()" as;
      • cost(hθ(xi), y) = 1/2(hθ(xi) - yi)2
      • Which evaluates to the cost for an individual example using the same measure as used in linear regression
    • We can redefine J(θ) as
      • Which, appropriately, is the sum of all the individual costs over the training data (i.e. the same as linear regression)
  • To further simplify it we can get rid of the superscripts
    • So
  • What does this actually mean?
    • This is the cost you want the learning algorithm to pay if the outcome is hθ(x) and the actual outcome is y
    • If we use this function for logistic regression this is a non-convex function for parameter optimization
      • Could work....
  • What do we mean by non convex?
    • We have some function - J(θ) - for determining the parameters
    • Our hypothesis function has a non-linearity (sigmoid function of hθ(x) )
      • This is a complicated non-linear function
    • If you take hθ(x) and plug it into the Cost() function, and them plug the Cost() function into J(θ) and plot J(θ) we find many local optimum -> non convex function
    • Why is this a problem
      • Lots of local minima mean gradient descent may not find the global optimum - may get stuck in a global minimum
    • We would like a convex function so if you run gradient descent you converge to a global minimum
A convex logistic regression cost function
  • To get around this we need a different, convex Cost() function which means we can apply gradient descent
  • This is our logistic regression cost function
    • This is the penalty the algorithm pays
    • Plot the function
  • Plot y = 1
    • So hθ(x) evaluates as -log(hθ(x))

  • So when we're right, cost function is 0
    • Else it slowly increases cost function as we become "more" wrong
    • X axis is what we predict
    • Y axis is the cost associated with that prediction
  • This cost functions has some interesting properties
    • If y = 1 and hθ(x) = 1
      • If hypothesis predicts exactly 1 and thats exactly correct then that corresponds to 0 (exactly, not nearly 0)
    • As hθ(x) goes to 0
      • Cost goes to infinity
      • This captures the intuition that if hθ(x) = 0 (predict (y=1|x; θ) = 0) but y = 1 this will penalize the learning algorithm with a massive cost
  • What about if y = 0
  • then cost is evaluated as -log(1- hθx ))
    • Just get inverse of the other function
  • Now it goes to plus infinity as hθ(x) goes to 1
  • With our particular cost functions J(θ) is going to be convex and avoid local minimum
Simplified cost function and gradient descent
  • Define a simpler way to write the cost function and apply gradient descent to the logistic regression
    • By the end should be able to implement a fully functional logistic regression function
  • Logistic regression cost function is as follows

  • This is the cost for a single example
    • For binary classification problems y is always 0 or 1
      • Because of this, we can have a simpler way to write the cost function
        • Rather than writing cost function on two lines/two cases
        • Can compress them into one equation - more efficient 
    • Can write cost function is
      • cost(hθ, (x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) ) 
        • This equation is a more compact of the two cases above
    • We know that there are only two possible cases
      • y = 1
        • Then our equation simplifies to
          • -log(hθ(x)) - (0)log(1 - hθ(x))
            • -log(hθ(x))
            • Which is what we had before when y = 1
      • y = 0
        • Then our equation simplifies to
          • -(0)log(hθ(x)) - (1)log(1 - hθ(x))
          • = -log(1- hθ(x))
          • Which is what we had before when y = 0
      • Clever!
  • So, in summary, our cost function for the θ parameters can be defined as
  • Why do we chose this function when other cost functions exist?
    • This cost function can be derived from statistics using the principle of maximum likelihood estimation
      • Note this does mean there's an underlying Gaussian assumption relating to the distribution of features 
    • Also has the nice property that it's convex
  • To fit parameters θ:
    • Find parameters θ which minimize J(θ)
    • This means we have a set of parameters to use in our model for future predictions
  • Then, if we're given some new example with set of features x, we can take the θ which we generated, and output our prediction using        
    • This result is
      • p(y=1 | x ; θ)
        • Probability y = 1, given x, parameterized by θ
How to minimize the logistic regression cost function
  • Now we need to figure out how to minimize J(θ)
    • Use gradient descent as before
    • Repeatedly update each parameter using a learning rate
  • If you had features, you would have an n+1 column vector for θ
  • This equation is the same as the linear regression rule
    • The only difference is that our definition for the hypothesis has changed
  • Previously, we spoke about how to monitor gradient descent to check it's working
    • Can do the same thing here for logistic regression
  • When implementing logistic regression with gradient descent, we have to update all the θ values (θ0 to θn) simultaneously
    • Could use a for loop
    • Better would be a vectorized implementation
  • Feature scaling for gradient descent for logistic regression also applies here


<source:http://www.holehouse.org/mlclass/06_Logistic_Regression.html, Andrew Ng's Coursera Lectures>

2016년 3월 22일 화요일

Logistic Regression (1)

Classification
  • Where y is a discrete value
    • Develop the logistic regression algorithm to determine what class a new input should fall into
  • Classification problems
    • Email -> spam/not spam?
    • Online transactions -> fraudulent?
    • Tumor -> Malignant/benign
  • Variable in these problems is Y
    • Y is either 0 or 1
      • 0 = negative class (absence of something)
      • 1 = positive class (presence of something)
  • Start with binary class problems
    • Later look at multiclass classification problem, although this is just an extension of binary classification
  • How do we develop a classification algorithm?
    • Tumour size vs malignancy (0 or 1)
    • We could use linear regression
      • Then threshold the classifier output (i.e. anything over some value is yes, else no)
      • In our example below linear regression with thresholding seems to work
  • We can see above this does a reasonable job of stratifying the data points into one of two classes
    • But what if we had a single Yes with a very small tumour 
    • This would lead to classifying all the existing yeses as nos
  • Another issues with linear regression
    • We know Y is 0 or 1
    • Hypothesis can give values large than 1 or less than 0
  • So, logistic regression generates a value where is always either 0 or 1
    • Logistic regression is a classification algorithm - don't be confused
Hypothesis representation
  • What function is used to represent our hypothesis in classification
  • We want our classifier to output values between 0 and 1
    • When using linear regression we did hθ(x) = (θT x)
    • For classification hypothesis representation we do hθ(x) = g((θT x))
      • Where we define g(z)
        • z is a real number
      • g(z) = 1/(1 + e-z)
        • This is the sigmoid function, or the logistic function
      • If we combine these equations we can write out the hypothesis as
  • What does the sigmoid function look like
  • Crosses 0.5 at the origin, then flattens out]
    • Asymptotes at 0 and 1
  • Given this we need to fit θ to our data
Interpreting hypothesis output
  • When our hypothesis (hθ(x)) outputs a number, we treat that value as the estimated probability that y=1 on input x
    • Example
      • If X is a feature vector with x0 = 1 (as always) and x1 = tumourSize
      • hθ(x) = 0.7
        • Tells a patient they have a 70% chance of a tumor being malignant
    • We can write this using the following notation
      • hθ(x) = P(y=1|x ; θ)
    • What does this mean?
      • Probability that y=1, given x, parameterized by θ
  • Since this is a binary classification task we know y = 0 or 1
    • So the following must be true
      • P(y=1|x ; θ) + P(y=0|x ; θ) = 1
      • P(y=0|x ; θ) = 1 - P(y=1|x ; θ)

Decision boundary
  • Gives a better sense of what the hypothesis function is computing
  • Better understand of what the hypothesis function looks like
    • One way of using the sigmoid function is;
      • When the probability of y being 1 is greater than 0.5 then we can predict y = 1
      • Else we predict y = 0
    • When is it exactly that hθ(x) is greater than 0.5?
      • Look at sigmoid function
        • g(z) is greater than or equal to 0.5 when z is greater than or equal to 0
      • So if z is positive, g(z) is greater than 0.5
        • z = (θT x)
      • So when 
        • θT x >= 0 
      • Then hθ >= 0.5
  • So what we've shown is that the hypothesis predicts y = 1 when θT x >= 0 
    • The corollary of that when θT x <= 0 then the hypothesis predicts y = 0 
    • Let's use this to better understand how the hypothesis makes its predictions
Decision boundary
  • hθ(x) = g(θ0 + θ1xθ2x2)



  • So, for example
    • θ0 = -3
    • θ1 = 1
    • θ2 = 1
  • So our parameter vector is a column vector with the above values
    • So, θT is a row vector = [-3,1,1]
  • What does this mean?
    • The z here becomes θT x
    • We predict "y = 1" if
      • -3x0 + 1x1 + 1x2 >= 0
      • -3 + x1 + x2 >= 0
  • We can also re-write this as
    • If (x1 + x2 >= 3) then we predict y = 1
    • If we plot
      • x1 + x2 = 3 we graphically plot our decision boundary

  • Means we have these two regions on the graph
    • Blue = false
    • Magenta = true
    • Line = decision boundary
      • Concretely, the straight line is the set of points where hθ(x) = 0.5 exactly
    • The decision boundary is a property of the hypothesis
      • Means we can create the boundary with the hypothesis and parameters without any data
        • Later, we use the data to determine the parameter values
      • i.e. y = 1 if
        • 5 - x1 > 0
        • 5 > x1
Non-linear decision boundaries
  • Get logistic regression to fit a complex non-linear data set
    • Like polynomial regress add higher order terms
    • So say we have
      • hθ(x) = g(θ0 + θ1x1θ3x12 + θ4x22)
      • We take the transpose of the θ vector times the input vector 
        • Say θT was [-1,0,0,1,1] then we say;
        • Predict that "y = 1" if
          • -1 + x12 + x22 >= 0
            or
          • x12 + x22 >= 1
        • If we plot x12 + x22 = 1
          • This gives us a circle with a radius of 1 around 0
  • Mean we can build more complex decision boundaries by fitting complex parameters to this (relatively) simple hypothesis
  • More complex decision boundaries?
    • By using higher order polynomial terms, we can get even more complex decision boundaries

<source: http://www.holehouse.org, Coursera Machine learning class of Andrew Ng>