'Statisitcs Theory/Data Analysis Practice' 카테고리의 글 목록 ─ 생존하려고 노력하는 사람

'Statisitcs Theory/Data Analysis Practice'에 해당되는 글 2건

[paractice] 해외 경제 참가자들의 경제참여확률 분석 2019.12.10
[Practice] Estimation of Return to Schooling 2019.11.19

2019. 12. 10. 17:14 - MunJunHyeok

남녀 경제 활동 참가율 분석 (해외)

1. 데이터 불러오기 및 데이터 정리

par <- read.csv('women_work.csv')
str(par)

## 'data.frame':    2000 obs. of  13 variables:
##  $ c1           : num  -0.436 0.352 1.077 1.021 -0.443 ...
##  $ c2           : num  -0.0969 0.3005 -1.596 -1.7105 0.3083 ...
##  $ u            : num  -0.218 0.176 0.539 0.511 -0.221 ...
##  $ v            : num  -0.3757 0.4612 -0.3762 -0.497 -0.0925 ...
##  $ county       : int  1 2 3 4 5 6 7 8 9 0 ...
##  $ age          : int  22 36 28 37 39 33 57 45 39 25 ...
##  $ education    : int  10 10 10 10 10 10 10 16 12 10 ...
##  $ married      : int  1 1 1 1 1 1 1 1 1 0 ...
##  $ children     : int  0 0 0 0 1 2 1 0 0 3 ...
##  $ select       : num  16.8 32.4 19.2 21.3 32 ...
##  $ wagefull     : num  12.8 20.3 23.1 24.5 16.1 ...
##  $ wage         : num  NA 20.3 NA NA 16.1 ...
##  $ participation: int  0 1 0 0 1 1 1 1 0 1 ...

par$married <- factor(par$married,
                      labels = c('single',
                                 'married'))
table(par$married)

## 
##  single married 
##     659    1341

* 참가율(%) 산출

## # A tibble: 2 x 3
##   participation     n percent
##           <int> <int>   <dbl>
## 1             0   657    32.8
## 2             1  1343    67.2

2. linear probaility model (linear regression), 선형회귀모델. age, education, married, children이 한 단위 늘었을 때, 경제활동 참가를 할 ’확률’이 얼마나 증가하는가?

## 
## Call:
## lm(formula = participation ~ age + education + married + children, 
##     data = par)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0703 -0.4142  0.1372  0.3437  0.8060 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.207323   0.054111  -3.831 0.000131 ***
## age             0.010255   0.001227   8.358  < 2e-16 ***
## education       0.018601   0.003250   5.724 1.20e-08 ***
## marriedmarried  0.111112   0.021948   5.063 4.52e-07 ***
## children        0.115308   0.006772  17.028  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4199 on 1995 degrees of freedom
## Multiple R-squared:  0.2026, Adjusted R-squared:  0.201 
## F-statistic: 126.7 on 4 and 1995 DF,  p-value: < 2.2e-16

* 선형회귀분석한 값을 이용한 예측모형과 그래프 (결혼 여부에 따른 차이)

-> 예측에 사용하려 했으나 그래프의 범위가 0과 1 사이를 벗어나 예측이 불가능하다.

3. generalized linear model (apply logistic function)
* age, education, married, children이 한 단위 늘었을 때, ’single index’가 얼마나 증가하는가? 즉, 확률이 늘어나는가? 혹은 줄어드는가? 여부만 알 수 있다.

## 
## Call:
## glm(formula = participation ~ age + education + married + children, 
##     family = "binomial", data = par)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6212  -0.9292   0.4614   0.8340   2.0455  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -4.159247   0.332040 -12.526  < 2e-16 ***
## age             0.057930   0.007221   8.022 1.04e-15 ***
## education       0.098251   0.018652   5.268 1.38e-07 ***
## marriedmarried  0.741777   0.126471   5.865 4.49e-09 ***
## children        0.764488   0.051529  14.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2532.4  on 1999  degrees of freedom
## Residual deviance: 2055.8  on 1995  degrees of freedom
## AIC: 2065.8
## 
## Number of Fisher Scoring iterations: 5

-> age, education, married, children 이 한 단위 증가할 때, 경제활동참가율이 (어느 정도인지는 모르나) 증가한다.

* glm의 그래프

-> 모든 값이 0과 1 사이에 들어와 예측모형에 활용 할 수 있다.
* 일반적으로 경제학자들은 이 그래프의 추세선의 기울기, 머신러닝 분야에서는 이 모델을 활용한 예측에 관심이 있다.

'Statisitcs Theory > Data Analysis Practice' 카테고리의 다른 글

[Practice] Estimation of Return to Schooling (0)	2019.11.19

[Practice] Estimation of Return to Schooling

2019. 11. 19. 17:22 - MunJunHyeok

교육 수익률의 추정

시작하기

Reading data: labor_supply_female.csv
Create New Chunk at MarkDown : ctrl + alt + I
라이브러리 로드

library(tidyverse)
library(readr)
library(gridExtra)
library(stargazer)
library(showtext)
font_add_google('Nanum Gothic','nanumgothic')
showtext::showtext_auto()

데이터 로드 및 편집

labor.sup <- readr::read_csv('labor_supply.csv')

labor.sup$w2edu <- factor(labor.sup$w2edu,
                          labels = c('무학','초졸','중졸','고졸','전문대졸','4년제','석사','박사')
                          )

plot으로 데이터 확인하기

labor.sup %>% 
  group_by(w2edu) %>%
  # x 축을 나이, y 축을 log 변환한 시간당 임금으로 설정
  ggplot(mapping = aes(x     = age,
                       y     = ln_wage_hourly)) +
  
  # Scatter gram과 추세선을 그림
  geom_point(aes(col = w2edu)) +
  geom_smooth(method = 'glm',
              formula = y ~ poly(x,2),
              color = 'steelblue',
              se = FALSE,
              linetype = 'dashed') +
  scale_color_brewer(palette = 'RdYlBu') +
  xlim(18, 80) +
  xlab('age') +
  ylab('log_hourly_wage')

Regression (회귀분석)

age를 독립변수로 갖는 회귀분석
age와 age의 제곱을 독립변수로 갖는 회귀분석
age와 age의 제곱, 교육수준을 독립변수로 갖는 회귀분석
age와 age의 제곱, 교육기간을 독립변수로 갖는 회귀분석

# no.1
lm.1 <- lm(ln_wage_hourly ~ age,
           data = labor.sup)
summary(lm.1)

## 
## Call:
## lm(formula = ln_wage_hourly ~ age, data = labor.sup)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7605 -0.4103 -0.0118  0.4344  2.6761 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.152733   0.076711   1.991   0.0467 *  
## age         -0.017896   0.001847  -9.689   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.774 on 1123 degrees of freedom
##   (694 observations deleted due to missingness)
## Multiple R-squared:  0.07714,    Adjusted R-squared:  0.07632 
## F-statistic: 93.87 on 1 and 1123 DF,  p-value: < 2.2e-16

# 결론 : age의 coefficient가 음수이다. 즉, 나이가 많을 수록 임금이 떨어진다.
# 모형이 제약적이기 때문에 예상과 다른 결과가 나온다.
# overfit: 지나치게 fitting을 해서 미래 예측이 불가한 상태.

# no.2
lm.2 <- lm(ln_wage_hourly ~ age + I(age^2),
           data = labor.sup)
summary(lm.2)

## 
## Call:
## lm(formula = ln_wage_hourly ~ age + I(age^2), data = labor.sup)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.87855 -0.40389 -0.02607  0.44161  2.97720 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.7420240  0.2244846  -7.760 1.90e-14 ***
## age          0.0787824  0.0109577   7.190 1.18e-12 ***
## I(age^2)    -0.0011215  0.0001254  -8.942  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7481 on 1122 degrees of freedom
##   (694 observations deleted due to missingness)
## Multiple R-squared:  0.1385, Adjusted R-squared:  0.137 
## F-statistic: 90.22 on 2 and 1122 DF,  p-value: < 2.2e-16

# 결론 a : age의 coefficient가 양수로 바뀌었다. 즉, 나이가 많을 수록 시간당 임금이 증가한다.
# age가 1년 증가할 때 마다 시간당 임금이 7.9% 올라가는 경향이 있다.
# 결론 b : 제곱항의 coefficient가 음수이다. 위로 볼록한 2차 함수의 형태를 띈다.
# 즉, 연령에 따른 한계수확은 나이가 많을수록 체감한다.

add the education variable, w2edu

# no.3
lm.3 <- lm(ln_wage_hourly ~ age + I(age^2) + w2edu,
           data = labor.sup)
summary(lm.3)

## 
## Call:
## lm(formula = ln_wage_hourly ~ age + I(age^2) + w2edu, data = labor.sup)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3140 -0.3123  0.0553  0.4167  1.9190 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.8605677  0.2419628 -11.822  < 2e-16 ***
## age            0.0742900  0.0112049   6.630 5.22e-11 ***
## I(age^2)      -0.0008390  0.0001336  -6.277 4.92e-10 ***
## w2edu초졸      0.1421755  0.1541257   0.922 0.356486    
## w2edu중졸      0.3391013  0.1627300   2.084 0.037403 *  
## w2edu고졸      0.6199602  0.1651687   3.753 0.000183 ***
## w2edu전문대졸  0.9276296  0.1752402   5.293 1.44e-07 ***
## w2edu4년제     1.2356706  0.1706671   7.240 8.33e-13 ***
## w2edu석사      1.6530789  0.1913012   8.641  < 2e-16 ***
## w2edu박사      1.8454967  0.3457613   5.337 1.14e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6758 on 1115 degrees of freedom
##   (694 observations deleted due to missingness)
## Multiple R-squared:  0.3015, Adjusted R-squared:  0.2958 
## F-statistic: 53.46 on 9 and 1115 DF,  p-value: < 2.2e-16

# 다중공선성 문제 해결을 위해 자유도를 제한한다. '무학' 독립변수를 제거한다.
# 제거된 독릴변수에 대비해 추정량 평가.
# 결론 : '무학' 교육수준에 비해 '고졸' 교육수준인 사람은 시간당 임금이 61% 높다 등.

Report the results of model 2,3,4

# no.4
lm.4 <- lm(ln_wage_hourly ~ age + I(age^2) + educ_year,
           data = labor.sup)
summary(lm.4)

## 
## Call:
## lm(formula = ln_wage_hourly ~ age + I(age^2) + educ_year, data = labor.sup)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7045 -0.3190  0.0404  0.4383  1.9297 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.1247333  0.2257963 -13.839  < 2e-16 ***
## age          0.0505973  0.0102098   4.956 8.31e-07 ***
## I(age^2)    -0.0005288  0.0001216  -4.348 1.50e-05 ***
## educ_year    0.1165055  0.0078876  14.771  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6848 on 1121 degrees of freedom
##   (694 observations deleted due to missingness)
## Multiple R-squared:  0.2789, Adjusted R-squared:  0.277 
## F-statistic: 144.5 on 3 and 1121 DF,  p-value: < 2.2e-16

stargazer::stargazer(lm.2,lm.3,lm.4,
                     type = 'text')

## 
## ===============================================================================================
##                                                 Dependent variable:                            
##                     ---------------------------------------------------------------------------
##                                                   ln_wage_hourly                               
##                               (1)                      (2)                       (3)           
## -----------------------------------------------------------------------------------------------
## age                         0.079***                 0.074***                 0.051***         
##                             (0.011)                  (0.011)                   (0.010)         
##                                                                                                
## I(age2)                    -0.001***                -0.001***                 -0.001***        
##                             (0.0001)                 (0.0001)                 (0.0001)         
##                                                                                                
## w2edu초졸                                               0.142                                    
##                                                      (0.154)                                   
##                                                                                                
## w2edu중졸                                              0.339**                                   
##                                                      (0.163)                                   
##                                                                                                
## w2edu고졸                                              0.620***                                  
##                                                      (0.165)                                   
##                                                                                                
## w2edu전문대졸                                            0.928***                                  
##                                                      (0.175)                                   
##                                                                                                
## w2edu4년제                                             1.236***                                  
##                                                      (0.171)                                   
##                                                                                                
## w2edu석사                                              1.653***                                  
##                                                      (0.191)                                   
##                                                                                                
## w2edu박사                                              1.845***                                  
##                                                      (0.346)                                   
##                                                                                                
## educ_year                                                                     0.117***         
##                                                                                (0.008)         
##                                                                                                
## Constant                   -1.742***                -2.861***                 -3.125***        
##                             (0.224)                  (0.242)                   (0.226)         
##                                                                                                
## -----------------------------------------------------------------------------------------------
## Observations                 1,125                    1,125                     1,125          
## R2                           0.139                    0.301                     0.279          
## Adjusted R2                  0.137                    0.296                     0.277          
## Residual Std. Error    0.748 (df = 1122)        0.676 (df = 1115)         0.685 (df = 1121)    
## F Statistic         90.220*** (df = 2; 1122) 53.464*** (df = 9; 1115) 144.514*** (df = 3; 1121)
## ===============================================================================================
## Note:                                                               *p<0.1; **p<0.05; ***p<0.01

# 결론 : 교육 기간이 1년 증가했을 때, 임금이 약 11.7% 증가한다.

'Statisitcs Theory > Data Analysis Practice' 카테고리의 다른 글

[paractice] 해외 경제 참가자들의 경제참여확률 분석 (0)	2019.12.10

10

MunJunHyeok

2019 12 10

'Statisitcs Theory > Data Analysis Practice' 카테고리의 다른 글

교육 수익률의 추정

시작하기

Regression (회귀분석)

Report the results of model 2,3,4

'Statisitcs Theory > Data Analysis Practice' 카테고리의 다른 글

티스토리툴바