Date: 03-08-2019
Author: Achyuthuni Sri Harsha

Introduction

Regression problems are an important category of problems in analytics in which the response variable \(Y\) takes a continuous value. For example, a regression goal is predicting housing prices in an area. In this blog post, I will attempt to solve a supervised regression problem using the famous Boston housing price data set. Other than location and square footage, a house value is determined by various other factors.
The data used in this blog is taken from Kaggle. It originates from the UCI Machine Learning Repository. The Boston housing data was collected in 1978 and each of the 506 entries represent aggregated data about 14 features for homes from various suburbs in Boston, Massachusetts.
The data frame contains the following columns:
crim: per capita crime rate by town.
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town.
chas: Charles River categorical variable (tract bounds or otherwise).
nox: nitrogen oxides concentration (parts per 10 million).
rm: average number of rooms per dwelling.
age: proportion of owner-occupied units built prior to 1940.
dis: weighted mean of distances to five Boston employment centers.
rad: index of accessibility to radial highways.
tax: full-value property-tax rate per $10,000.
ptratio: pupil-teacher ratio by town.
black: racial discrimination factor.
lstat: lower status of the population (percent)

The target variable is
medv: median value of owner-occupied homes in $1000s.

In particular, in this blog I want to use multiple linear regression for the analysis. A sample of the data is given below:

Sample data
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
2.92400 0.0 19.58 otherwise 0.6050 6.101 93.0 2.2834 5 403 14.7 240.16 9.81 25.0
0.12816 12.5 6.07 otherwise 0.4090 5.885 33.0 6.4980 4 345 18.9 396.90 8.79 20.9
0.08244 30.0 4.93 otherwise 0.4280 6.481 18.5 6.1899 6 300 16.6 379.41 6.36 23.7
0.06588 0.0 2.46 otherwise 0.4880 7.765 83.3 2.7410 3 193 17.8 395.56 7.56 39.8
0.02009 95.0 2.68 otherwise 0.4161 8.034 31.9 5.1180 4 224 14.7 390.55 2.88 50.0

The summary statistics for the data is:

##       crim                zn             indus           chas          
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Length:506        
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   Class :character  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Mode  :character  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14                     
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10                     
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74                     
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Data Cleaning and EDA

Zero and Near Zero Variance features do not explain any variance in the predictor variable.
Zero and near zero variance
freqRatio percentUnique zeroVar nzv
crim 1.000000 99.6047431 FALSE FALSE
zn 17.714286 5.1383399 FALSE FALSE
indus 4.400000 15.0197628 FALSE FALSE
chas 13.457143 0.3952569 FALSE FALSE
nox 1.277778 16.0079051 FALSE FALSE
rm 1.000000 88.1422925 FALSE FALSE
age 10.750000 70.3557312 FALSE FALSE
dis 1.250000 81.4229249 FALSE FALSE
rad 1.147826 1.7786561 FALSE FALSE
tax 3.300000 13.0434783 FALSE FALSE
ptratio 4.117647 9.0909091 FALSE FALSE
black 40.333333 70.5533597 FALSE FALSE
lstat 1.000000 89.9209486 FALSE FALSE
medv 2.000000 45.2569170 FALSE FALSE

There are no near zero or zero variance columns

Similarly I can check for linearly dependent columns among the continuous variables.

## $linearCombos
## list()
## 
## $remove
## NULL

There are no linearly dependent columns.

Uni-variate analysis

Now, I want to do some basic EDA on each column. On each continuous column, I want to visually check the following:
1. Variation in the column
2. Its distribution
3. Any outliers
4. q-q plot with normal distribution

## [1] "Univariate plots for crim"

## [1] "Univariate plots for zn"

## [1] "Univariate plots for indus"

## [1] "Univariate plots for nox"

## [1] "Univariate plots for rm"

## [1] "Univariate plots for age"

## [1] "Univariate plots for dis"

## [1] "Univariate plots for rad"

## [1] "Univariate plots for tax"

## [1] "Univariate plots for ptratio"

## [1] "Univariate plots for black"

## [1] "Univariate plots for lstat"

## [1] "Univariate plots for medv"

For categorical variables, I want to look at the frequencies.

Observations:
1. If I look at rad and tax, I observe that there seem to be two categories. Houses having rad < 10 follow a normal distribution, and there are some houses with rad = 24. As rad is an index, it could also be thought of as a categorical variable instead of a continuous variable.
2. For data points with rad= 25, the behavior in location based features looks different. For example indus, tax and ptratio have a different slope at the same points where the rad is 24. (observation for variation plots(top left plots))

Therefore I am tempted to group the houses which have rad = 24 into one category, and create interaction variables of that column with rad, indus, ptratio and tax. The new variable is called rad_cat. Also, I would like to convert rad itself to categorical and see if it can explain better than continuous variable.
Additionally, from researching on the internet, I found that the cost might have a different slope with the number of bedrooms for different class of people. So, I want to visualize that interaction variable also.

Bi variate analysis

I want to understand the relationship of each continuous variable with the \(y\) variable. I will achieve that by doing the following:
1. A scatter plot to look at the relationship between the \(x\) and the \(y\) variables.
2. Draw a linear regression line and a smoothed means line to understand the curve fit.
3. Predict using Linear regression using the variable alone to observe the increase in R-squared.

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.816  -5.455  -1.970   2.633  29.615 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.99736    0.45955  52.220  < 2e-16 ***
## crim        -0.39123    0.04855  -8.059 8.75e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.595 on 405 degrees of freedom
## Multiple R-squared:  0.1382, Adjusted R-squared:  0.1361 
## F-statistic: 64.94 on 1 and 405 DF,  p-value: 8.748e-15
## 
## [1] "----------------------------------------------------------------------------------------------------"

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.9353  -5.6853  -0.9847   2.4653  29.0647 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.93532    0.48739  42.954  < 2e-16 ***
## zn           0.14247    0.01818   7.835 4.15e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.812 on 405 degrees of freedom
## Multiple R-squared:  0.1316, Adjusted R-squared:  0.1295 
## F-statistic: 61.39 on 1 and 405 DF,  p-value: 4.155e-14
## 
## [1] "----------------------------------------------------------------------------------------------------"

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.922  -5.144  -1.631   2.972  33.069 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.04045    0.77385   38.82   <2e-16 ***
## indus       -0.66951    0.05936  -11.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.249 on 405 degrees of freedom
## Multiple R-squared:  0.239,  Adjusted R-squared:  0.2372 
## F-statistic: 127.2 on 1 and 405 DF,  p-value: < 2.2e-16
## 
## [1] "----------------------------------------------------------------------------------------------------"

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.688  -5.146  -2.299   2.794  30.643 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   42.020      2.081  20.195   <2e-16 ***
## nox          -35.028      3.680  -9.518   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.548 on 405 degrees of freedom
## Multiple R-squared:  0.1828, Adjusted R-squared:  0.1808 
## F-statistic:  90.6 on 1 and 405 DF,  p-value: < 2.2e-16
## 
## [1] "----------------------------------------------------------------------------------------------------"

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.886  -2.551   0.174   3.009  39.729 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.6702     2.8680  -12.79   <2e-16 ***
## rm            9.4450     0.4538   20.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.574 on 405 degrees of freedom
## Multiple R-squared:  0.5168, Adjusted R-squared:  0.5156 
## F-statistic: 433.1 on 1 and 405 DF,  p-value: < 2.2e-16
## 
## [1] "----------------------------------------------------------------------------------------------------"

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.138  -5.266  -2.033   2.333  31.332 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  31.1468     1.1373  27.386  < 2e-16 ***
## age          -0.1248     0.0154  -8.104 6.33e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.772 on 405 degrees of freedom
## Multiple R-squared:  0.1395, Adjusted R-squared:  0.1374 
## F-statistic: 65.68 on 1 and 405 DF,  p-value: 6.333e-15
## 
## [1] "----------------------------------------------------------------------------------------------------"

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.010  -5.867  -1.968   2.297  30.275 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.4243     0.9436  19.526  < 2e-16 ***
## dis           1.1127     0.2188   5.085 5.62e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.168 on 405 degrees of freedom
## Multiple R-squared:  0.06002,    Adjusted R-squared:  0.0577 
## F-statistic: 25.86 on 1 and 405 DF,  p-value: 5.619e-07
## 
## [1] "----------------------------------------------------------------------------------------------------"

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.022  -5.310  -2.298   3.375  33.475 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.7220     0.6412  41.677   <2e-16 ***
## rad          -0.4249     0.0493  -8.619   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.693 on 405 degrees of freedom
## Multiple R-squared:  0.155,  Adjusted R-squared:  0.1529 
## F-statistic: 74.28 on 1 and 405 DF,  p-value: < 2.2e-16
## 
## [1] "----------------------------------------------------------------------------------------------------"

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.039  -5.235  -2.191   3.166  34.209 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.707058   1.080289   31.20   <2e-16 ***
## tax         -0.026900   0.002427  -11.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.283 on 405 degrees of freedom
## Multiple R-squared:  0.2328, Adjusted R-squared:  0.2309 
## F-statistic: 122.9 on 1 and 405 DF,  p-value: < 2.2e-16
## 
## [1] "----------------------------------------------------------------------------------------------------"

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4897  -4.9434  -0.7651   3.4363  31.2566 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  64.8226     3.4209   18.95   <2e-16 ***
## ptratio      -2.2811     0.1837  -12.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.047 on 405 degrees of freedom
## Multiple R-squared:  0.2758, Adjusted R-squared:  0.274 
## F-statistic: 154.2 on 1 and 405 DF,  p-value: < 2.2e-16
## 
## [1] "----------------------------------------------------------------------------------------------------"

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.573  -5.028  -1.864   2.874  27.066 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.505391   1.785873   5.882 8.47e-09 ***
## black        0.033945   0.004844   7.008 1.02e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.93 on 405 degrees of freedom
## Multiple R-squared:  0.1081, Adjusted R-squared:  0.1059 
## F-statistic: 49.11 on 1 and 405 DF,  p-value: 1.017e-11
## 
## [1] "----------------------------------------------------------------------------------------------------"

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.023  -4.173  -1.390   2.172  24.327 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.88480    0.63310   55.10   <2e-16 ***
## lstat       -0.96665    0.04336  -22.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.336 on 405 degrees of freedom
## Multiple R-squared:  0.551,  Adjusted R-squared:  0.5499 
## F-statistic:   497 on 1 and 405 DF,  p-value: < 2.2e-16
## 
## [1] "----------------------------------------------------------------------------------------------------"