FREAKOMBINATION: regression

레이블이 regression인 게시물을 표시합니다. 모든 게시물 표시

2018년 4월 9일 월요일

회귀 모델링의 전제조건

데이터의 실측치와 모델의 추정치의 차이인 잔차가 i.i.d.를 만족해야함.

i.i.d. (Independent and Identically Distributed random)

잔차의 분포는 정규분포

확인방법

잔차의 히스토그램
Shapiro-Wilk normality test

결과로 나온 p-value가 높으면 정규분포를 따름.

normal quantile plot
Skewness와 Kurtosis를 check

skewness(third standardized moment): 분포가 얼마나 기울어 있나
Kurtosis(fourth standardized moment): 분포가 얼마나 뚱뚱한가

자크베라(Jacqa Beata) test

잔차는 독립이어야 함

잔차와 독립변수 사이에 상관관계가 없고, 자기 자신과도 상관이 없어야 함.
확인 방법

독립변수와 잔차의 correlation
scatter plot

잔차의 분포가 일정해야 함. (잔차의 등분산성)

출처(https://brunch.co.kr/@gimmesilver/17)이며, 읽고 간단히 키워드만 정리함.

2016년 5월 9일 월요일

주택가격 결정 요인: 분위수 회귀분석 - 헤도닉(hedonic) 특성을 고려한 분위수(quantile) 분석 방법

헤도닉(hedonic)특성: 주택은 물리적 특성이 같을 지라도 동일한 가격이 성립되지 않고 개별주택에 따라 가격 결정요인이 달라진다.

주택가격 분위수별로 주택의 헤도닉 특성이 주택가격에 미치는 효과를 구분하여 추정하기 위하여 분위수(quantile) 분석 방법을 사용.

분위수 회귀분석은 주택가격 분위수 별로 서로 다른 헤도닉적인 특성을 고려할 수 있으므로 이분산(heterogeneity) 문제를 해결할 수 있다.

헤도닉적인 특성은 주택의 물리적 성격, 주택의 인근 지역과 환경적 특성, 주택의 위치와 접근성으로 구분된다.

헤도닉특성이 미치는 효과가 가격이 낮은 분위수에서 미치는 효과와 가격이 높은 분위수에서 미치는 효과가 상이하게 나타날 것으로 예상하여, 주택가격의 분위수를 5%,10%,25%,50%,75%,90%,95%으로 구분하여 헤도닉특성과 주택가격의 상관관계를 파악.

변수의 정규분포 여부를 검정하는 Watson 통계량을 보면 자료가 정규분포를 보이지 않으므로 분위수 회귀부석이 필요하다.

아파트가격의 결정요인 중 헤도닉특성 요인으로서 아파트의 건축경과 연수(년), 전용면적(m2), 총 층수, 거주층수, 남향여부 더미, 지하철 더미, 고등학교 더미, 산과 강의 풍경이 보이는 조망권 더미 등을 사용하였으며, AIC 기준에 따라 건축경과연 수와 전용면적의 제곱 변수를 추가적으로 포함하여 비선형적 모형을 추정하였다. 남향 여부 더미는 남향이면 1, 아니면 0으로, 지하철 더미는 지하철역까지 도보로 10분 이내 이면 1, 아니면 0으로 하였으며, 고등학교 더미는 고등학교까지 도보로 10분 이내이면 1, 아니면 0으로 하였다.

출처: 서울 주택가격의 결정요인: 분위수 회귀분석

Determinants of House Prices in Seoul: Quantile Regression Approach

김희호(Hee-Ho Kim) , 박세운(Sae-Woon Park) 저

2016년 3월 31일 목요일

Logistic Regression (2)

Cost function for logistic regression

Fit θ parameters
Define the optimization object for the cost function we use the fit the parameters
- Training set of m training examples
- - Each example has is n+1 length column vector

This is the situation
- Set of m training examples
- Each example is a feature vector which is n+1 dimensional
- x₀ = 1
- y ∈ {0,1}
- Hypothesis is based on parameters (θ)
  - Given the training set how to we chose/fit θ?
Linear regression uses the following function to determine θ

Instead of writing the squared error term, we can write
- If we define "cost()" as;
  - cost(h_θ(xⁱ), y) = 1/2(h_θ(xⁱ) - yⁱ)²
  - Which evaluates to the cost for an individual example using the same measure as used in linear regression
- We can redefine J(θ) as
  - Which, appropriately, is the sum of all the individual costs over the training data (i.e. the same as linear regression)
To further simplify it we can get rid of the superscripts
- So

What does this actually mean?
- This is the cost you want the learning algorithm to pay if the outcome is h_θ(x) and the actual outcome is y
- If we use this function for logistic regression this is a non-convex function for parameter optimization
  - Could work....
What do we mean by non convex?
- We have some function - J(θ) - for determining the parameters
- Our hypothesis function has a non-linearity (sigmoid function of h_θ(x) )
  - This is a complicated non-linear function
- If you take h_θ(x) and plug it into the Cost() function, and them plug the Cost() function into J(θ) and plot J(θ) we find many local optimum -> non convex function
- Why is this a problem
  - Lots of local minima mean gradient descent may not find the global optimum - may get stuck in a global minimum
- We would like a convex function so if you run gradient descent you converge to a global minimum

A convex logistic regression cost function

To get around this we need a different, convex Cost() function which means we can apply gradient descent

This is our logistic regression cost function
- This is the penalty the algorithm pays
- Plot the function
Plot y = 1
- So h_θ(x) evaluates as -log(h_θ(x))

So when we're right, cost function is 0
- Else it slowly increases cost function as we become "more" wrong
- X axis is what we predict
- Y axis is the cost associated with that prediction
This cost functions has some interesting properties
- If y = 1 and h_θ(x) = 1
  - If hypothesis predicts exactly 1 and thats exactly correct then that corresponds to 0 (exactly, not nearly 0)
- As h_θ(x) goes to 0
  - Cost goes to infinity
  - This captures the intuition that if h_θ(x) = 0 (predict P (y=1|x; θ) = 0) but y = 1 this will penalize the learning algorithm with a massive cost
What about if y = 0
then cost is evaluated as -log(1- h_θ( x ))
- Just get inverse of the other function

Now it goes to plus infinity as h_θ(x) goes to 1
With our particular cost functions J(θ) is going to be convex and avoid local minimum

Simplified cost function and gradient descent

Define a simpler way to write the cost function and apply gradient descent to the logistic regression
- By the end should be able to implement a fully functional logistic regression function
Logistic regression cost function is as follows

This is the cost for a single example
- For binary classification problems y is always 0 or 1
- - Because of this, we can have a simpler way to write the cost function
  - - Rather than writing cost function on two lines/two cases
    - Can compress them into one equation - more efficient
- Can write cost function is
  - cost(h_θ,(x),y) = -ylog( h_θ(x) ) - (1-y)log( 1- h_θ(x) )
    - This equation is a more compact of the two cases above
- We know that there are only two possible cases
- - y = 1
    - Then our equation simplifies to
      - -log(h_θ(x)) - (0)log(1 - h_θ(x))
        
        -log(h_θ(x))
        
        Which is what we had before when y = 1
  - y = 0
    - Then our equation simplifies to
      - -(0)log(h_θ(x)) - (1)log(1 - h_θ(x))
      - = -log(1- h_θ(x))
      - Which is what we had before when y = 0
  - Clever!
So, in summary, our cost function for the θ parameters can be defined as

Why do we chose this function when other cost functions exist?
- This cost function can be derived from statistics using the principle of maximum likelihood estimation
  - Note this does mean there's an underlying Gaussian assumption relating to the distribution of features
- Also has the nice property that it's convex
To fit parameters θ:
- Find parameters θ which minimize J(θ)
- This means we have a set of parameters to use in our model for future predictions
Then, if we're given some new example with set of features x, we can take the θ which we generated, and output our prediction using
- This result is
- - p(y=1 | x ; θ)
  - - Probability y = 1, given x, parameterized by θ

How to minimize the logistic regression cost function

Now we need to figure out how to minimize J(θ)
- Use gradient descent as before
- Repeatedly update each parameter using a learning rate

If you had n features, you would have an n+1 column vector for θ
This equation is the same as the linear regression rule
- The only difference is that our definition for the hypothesis has changed
Previously, we spoke about how to monitor gradient descent to check it's working
- Can do the same thing here for logistic regression
When implementing logistic regression with gradient descent, we have to update all the θ values (θ₀ to θ_n) simultaneously
- Could use a for loop
- Better would be a vectorized implementation
Feature scaling for gradient descent for logistic regression also applies here

<source:http://www.holehouse.org/mlclass/06_Logistic_Regression.html, Andrew Ng's Coursera Lectures>

2016년 3월 22일 화요일

Logistic Regression (1)

Classification

Where y is a discrete value
- Develop the logistic regression algorithm to determine what class a new input should fall into
Classification problems
- Email -> spam/not spam?
- Online transactions -> fraudulent?
- Tumor -> Malignant/benign
Variable in these problems is Y
- Y is either 0 or 1
  - 0 = negative class (absence of something)
  - 1 = positive class (presence of something)
Start with binary class problems
- Later look at multiclass classification problem, although this is just an extension of binary classification
How do we develop a classification algorithm?
- Tumour size vs malignancy (0 or 1)
- We could use linear regression
- - Then threshold the classifier output (i.e. anything over some value is yes, else no)
  - In our example below linear regression with thresholding seems to work

We can see above this does a reasonable job of stratifying the data points into one of two classes
- But what if we had a single Yes with a very small tumour
- This would lead to classifying all the existing yeses as nos
Another issues with linear regression
- We know Y is 0 or 1
- Hypothesis can give values large than 1 or less than 0
So, logistic regression generates a value where is always either 0 or 1
- Logistic regression is a classification algorithm - don't be confused

Hypothesis representation

What function is used to represent our hypothesis in classification
We want our classifier to output values between 0 and 1
- When using linear regression we did h_θ(x) = (θ^T x)
- For classification hypothesis representation we do h_θ(x) = g((θ^T x))
- - Where we define g(z)
    - z is a real number
  - g(z) = 1/(1 + e^-z)
    - This is the sigmoid function, or the logistic function
  - If we combine these equations we can write out the hypothesis as
What does the sigmoid function look like
Crosses 0.5 at the origin, then flattens out]
- Asymptotes at 0 and 1

Given this we need to fit θ to our data

Interpreting hypothesis output

When our hypothesis (h_θ(x)) outputs a number, we treat that value as the estimated probability that y=1 on input x
- Example
  - If X is a feature vector with x₀ = 1 (as always) and x₁ = tumourSize
  - h_θ(x) = 0.7
    - Tells a patient they have a 70% chance of a tumor being malignant
- We can write this using the following notation
  - h_θ(x) = P(y=1|x ; θ)
- What does this mean?
  - Probability that y=1, given x, parameterized by θ
Since this is a binary classification task we know y = 0 or 1
- So the following must be true
  - P(y=1|x ; θ) + P(y=0|x ; θ) = 1
  - P(y=0|x ; θ) = 1 - P(y=1|x ; θ)

Decision boundary

Gives a better sense of what the hypothesis function is computing
Better understand of what the hypothesis function looks like
- One way of using the sigmoid function is;
  - When the probability of y being 1 is greater than 0.5 then we can predict y = 1
  - Else we predict y = 0
- When is it exactly that h_θ(x) is greater than 0.5?
  - Look at sigmoid function
    - g(z) is greater than or equal to 0.5 when z is greater than or equal to 0
  - So if z is positive, g(z) is greater than 0.5
    - z = (θ^T x)
  - So when
    - θ^T x >= 0
  - Then h_θ >= 0.5
So what we've shown is that the hypothesis predicts y = 1 when θ^T x >= 0
- The corollary of that when θ^T x <= 0 then the hypothesis predicts y = 0
- Let's use this to better understand how the hypothesis makes its predictions

Decision boundary

h_θ(x) = g(θ₀ + θ₁x₁+ θ₂x₂)

So, for example
- θ₀ = -3
- θ₁ = 1
- θ₂ = 1
So our parameter vector is a column vector with the above values
- So, θ^T is a row vector = [-3,1,1]
What does this mean?
- The z here becomes θ^T x
- We predict "y = 1" if
  - -3x₀ + 1x₁ + 1x₂ >= 0
  - -3 + x₁ + x₂ >= 0
We can also re-write this as
- If (x₁ + x₂ >= 3) then we predict y = 1
- If we plot
  - x₁ + x₂ = 3 we graphically plot our decision boundary

Means we have these two regions on the graph
- Blue = false
- Magenta = true
- Line = decision boundary
- - Concretely, the straight line is the set of points where h_θ(x) = 0.5 exactly
- The decision boundary is a property of the hypothesis
  - Means we can create the boundary with the hypothesis and parameters without any data
    - Later, we use the data to determine the parameter values
  - i.e. y = 1 if
    - 5 - x₁ > 0
    - 5 > x₁

Non-linear decision boundaries

Get logistic regression to fit a complex non-linear data set
- Like polynomial regress add higher order terms
- So say we have
- - h_θ(x) = g(θ₀ + θ₁x₁+ θ₃x₁² + θ₄x₂²)
  - We take the transpose of the θ vector times the input vector
  - - Say θ^T was [-1,0,0,1,1] then we say;
    - Predict that "y = 1" if
    - - -1 + x₁² + x₂² >= 0
        or
      - x₁² + x₂² >= 1
    - If we plot x₁² + x₂² = 1
      - This gives us a circle with a radius of 1 around 0

Mean we can build more complex decision boundaries by fitting complex parameters to this (relatively) simple hypothesis
More complex decision boundaries?
- By using higher order polynomial terms, we can get even more complex decision boundaries

<source: http://www.holehouse.org, Coursera Machine learning class of Andrew Ng>