Tree Based Methods¶

Bootstrap Example¶

x<-c(1:100)
B<-100
p_boot <- numeric(B)
for(b in 1:B) {
  ## Take bootstrap sample and record the win proportion.
  p_boot[b] <- mean(sample(x, replace = TRUE)) 
}

p_boot

mean(p_boot)

mean(x)

Illustrative Example¶

### Generate Data
x<-3*runif(100,1,5)
y<-sin(x)
### Add some noise
noise_sample <- sample(1:length(y),15)
y[noise_sample]= 2*(0.5 - runif(15,-1,0.3))
df<-cbind(x,y)
df<-data.frame(df)
plot(y~x,col="blue")

library(tree)
require(randomForest)
tree.df = tree(y~.,df)
prune.df =prune.tree(tree.df ,best =3)
rf.df =randomForest(y~.,df)
yhat.rf=predict (rf.df ,newdata =df)
yhat.tree=predict (tree.df ,newdata =df)
plot(tree.df)
text(tree.df,pretty =0)
tree.df

node), split, n, deviance, yval
      * denotes terminal node

 1) root 100 82.4300  0.24960  
   2) x < 12.7883 84 62.6800  0.09127  
     4) x < 9.53549 55 41.0500  0.33580  
       8) x < 6.13105 27 22.5700 -0.12870  
        16) x < 4.17296 11  8.1060  0.23960 *
        17) x > 4.17296 16 11.9500 -0.38190 *
       9) x > 6.13105 28  7.0340  0.78370  
        18) x < 9.03951 22  4.4500  0.92980 *
        19) x > 9.03951 6  0.3925  0.24810 *
     5) x > 9.53549 29 12.1100 -0.37250  
      10) x < 11.4208 18  3.2000 -0.65180  
        20) x < 9.9394 5  0.6552 -0.08283 *
        21) x > 9.9394 13  0.3032 -0.87070 *
      11) x > 11.4208 11  5.2070  0.08463 *
   3) x > 12.7883 16  6.5850  1.08100 *

plot(y~x,col="blue",pch=18)
points(yhat.tree~x,col="red",pch=19)
#points(yhat.rf~x,col="green",pch=20)
#legend("bottomleft",legend=c("y","Tree-y","RF-Y"),
 #      col=c("blue","red","green"),pch=18:19)

prune.df =prune.tree(tree.df ,best =3)
yhat.tree.p=predict (prune.df ,newdata =df)
plot(prune.df)
text(prune.df,pretty =0)
tree.df

node), split, n, deviance, yval
      * denotes terminal node

 1) root 100 82.4300  0.24960  
   2) x < 12.7883 84 62.6800  0.09127  
     4) x < 9.53549 55 41.0500  0.33580  
       8) x < 6.13105 27 22.5700 -0.12870  
        16) x < 4.17296 11  8.1060  0.23960 *
        17) x > 4.17296 16 11.9500 -0.38190 *
       9) x > 6.13105 28  7.0340  0.78370  
        18) x < 9.03951 22  4.4500  0.92980 *
        19) x > 9.03951 6  0.3925  0.24810 *
     5) x > 9.53549 29 12.1100 -0.37250  
      10) x < 11.4208 18  3.2000 -0.65180  
        20) x < 9.9394 5  0.6552 -0.08283 *
        21) x > 9.9394 13  0.3032 -0.87070 *
      11) x > 11.4208 11  5.2070  0.08463 *
   3) x > 12.7883 16  6.5850  1.08100 *

plot(y~x,col="blue",pch=18)
points(yhat.tree~x,col="red",pch=19)
points(yhat.tree.p~x,col="green",pch=20)

Data Analysis¶

In this section we are going to use tree based methods to predict the medium house price. We are going to use Boston house prices dataset. This data frame contains the following columns:

crim -per capita crime rate by town.
zn -proportion of residential land zoned for lots over 25,000 sq.ft.
indus - proportion of non-retail business acres per town.
chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox -nitrogen oxides concentration (parts per 10 million).
rm- average number of rooms per dwelling.
age - proportion of owner-occupied units built prior to 1940.
dis - weighted mean of distances to five Boston employment centres.
rad - index of accessibility to radial highways.
tax - full-value property-tax rate per $\$10,000$.
ptratio - pupil-teacher ratio by town.
black - $1000(Bk−0.63)^2$ where Bk is the proportion of blacks by town.
lstat - lower status of the population (percent).
medv - median value of owner-occupied homes in $1000s.

The tree library is used to construct classification and regression trees.

#install.packages("tree")
library(tree)

Here we fit a regression tree to the Boston dataset.

library(MASS)
head(Boston)

Check missing values

sum(is.na(Boston))

dim(Boston)

summary(Boston)

      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptratio          black       
 Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
 Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
 Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
 3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
 Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat            medv      
 Min.   : 1.73   Min.   : 5.00  
 1st Qu.: 6.95   1st Qu.:17.02  
 Median :11.36   Median :21.20  
 Mean   :12.65   Mean   :22.53  
 3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :37.97   Max.   :50.00

Boston$chas

attach(Boston)

The following object is masked _by_ .GlobalEnv:

    chas

The following objects are masked from Boston (pos = 4):

    age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
    rm, tax, zn

Note that the variable chas treated as numerical even though it is binary.

chas<-factor(chas)
levels(chas)

hist(medv)

qqnorm(medv,pch=1,frame=FALSE)
qqline(medv,col="blue",lwd=2)

require(gpairs)
gpairs(Boston)

Let see how the the med price is distributed based on the variable chas

plot(zn~medv)

boxplot(medv~chas)

crim_dummy =ifelse(crim<mean(crim),1,0)
table(crim_dummy)

crim_dummy
  0   1 
128 378

boxplot(medv~crim_dummy)

The majority of the zn values are zero. That is, many suburbs have no residential land zoned for lots over 25,000 sq ft.

zn

zn.zero<-ifelse(zn==0,1,0)
table(zn.zero)

zn.zero
  0   1 
134 372

boxplot(medv~zn.zero)

boxplot(indus[zn.zero==1],indus[zn.zero==0],names=c("Zero Res. Land Zoned", "Some Res. Land Zoned"),main="Proportion of Non-Retail Business Acres")

The suburbs with no residential land zoned for lots over 25,000 sq ft have a much higher proportion of non-retail business acres. Large lots usually correspond to business parks, strip shopping centers, shopping malls, etc. It would make sense that areas with no zoned lots of this size would have a higher proportion of non-retail businesses. The variance of this group is also much higher, likely due to outliers.

The rm variable is real-valued (because it is an average). We create a categorical variable rm2 that rounds the number of rooms to the nearest integer. Let check what are the different values of number of rooms now and how often do each of them occur?

summary(rm)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.561   5.886   6.208   6.285   6.623   8.780

rm2<-round(rm,0)
summary(rm2)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.000   6.000   6.000   6.267   7.000   9.000

sort(unique(rm2))

table(rm2)

rm2
  4   5   6   7   8   9 
  5  37 312 125  24   3

Now we use graphs to illustrate the differences (for at least three variables) between towns with a pupil-teacher ratio of less than 20 and towns with a pupil-teacher ratio of greater than or equal to 20.

summary(ptratio)
pt.20<-ifelse(ptratio>=20,1,0)

table(pt.20)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  12.60   17.40   19.05   18.46   20.20   22.00

pt.20
  0   1 
305 201

m<-matrix(c(1,2,3,4,5,6),ncol=2)
layout(m)

#Crime Rate:  very different distributions; best looked at with histograms
hist(crim[pt.20==1],xlab="Per Capita Crime Rate",main="Crime: PT Ratio >= 20")
hist(crim[pt.20==0],xlab="Per Capita Crime Rate",main="Crime: PT Ratio < 20")

# Proportion of Residential Land zoned:   
#the huge frequency of zn = 0 makes the distribution hard to graph.  Probably more useful to look at the 
#table of the two indicator variables (zn = 0 yes/no vs. pt.ratio >= 20  yes/no)
barplot(table(zn.zero,pt.20),beside=T, names=c("Pt.Ratio >=20", "Pt.Ratio<20"),legend.text=c("None Zoned", "Some Zoned"),
        main="Residential Zoning vs. PT Ratio")

#Proportion of Non-retail business acres per town:  very different distributions; best looked at with histograms
hist(indus[pt.20==1],xlab="Proportion of Non-Retail Business Acres",main="Indus: Pt Ratio >=20")
hist(indus[pt.20==0],xlab="Proportion of Non-Retail Business Acres",main="Indus: Pt Ratio < 20")

#Charles River: dummy variable; again we look at the table of the two indicator variables
barplot(table(chas,pt.20),beside=T, names=c("Pt.Ratio >=20", "Pt.Ratio<20"),legend.text=c("Bounds River", "Doesn’t Bound River"),
        main="Charles River vs. PT Ratio")

The per capita crime rate appears to be much higher in towns with a greater pupil-teacher ratio. The proportion of non-retail business acres is not distributed that much differently when you look at the actual frequencies excepting a very large peak around 15-20% in the suburbs with a large pupil-teacher ratio. Differences in amount of residential land zoned are apparent in suburbs with a low pupil-teacher ratio. The differences between suburbs on the river and off the river do not appear to depend greatly on the pupil-teacher ratio.

par(mfrow=c(3,2))
#Nitrogen Oxides Concentration (parts per 10 million)
boxplot(nox[pt.20==1],nox[pt.20==0],names=c("Pt.Ratio >=20", "Pt.Ratio<20"),main="NOX vs. PT Ratio")

#Average number of rooms per dwelling
boxplot(rm[pt.20==1],rm[pt.20==0],names=c("Pt.Ratio >=20", "Pt.Ratio<20"),main="Avg # of Rooms vs. PT Ratio")

#Proportion of owner-occupied units built prior to 1940
boxplot(age[pt.20==1],age[pt.20==0],names=c("Pt.Ratio >=20", "Pt.Ratio<20"),main="# of Units built pre-1940 vs. PT Ratio")

#weighted mean of distances to five Boston employment centers
boxplot(dis[pt.20==1],dis[pt.20==0],names=c("Pt.Ratio >=20", "Pt.Ratio<20"),main="Dist to Employment Centers vs. PT Ratio")

# Index of Accessibility to radial highways:  very different distributions but integer-valued
#better to use barplots but we want the ranges to be the same
rad1.freq<-rad2.freq<-rep(0,max(rad,na.rm=T)-min(rad,na.rm=T)+1)
for(i in 1:506){
    if(!is.na(pt.20[i])){
	if(pt.20[i]==1) rad1.freq[rad[i]]<-rad1.freq[rad[i]]+1
	if(pt.20[i]==0) rad2.freq[rad[i]]<-rad2.freq[rad[i]]+1
    }
}
barplot(rad1.freq,xlab="Index of Accessibility to Radial Highways",names=seq(1,24),col=3,main="Rad: PT Ratio >=20")
barplot(rad2.freq,xlab="Index of Accessibility to Radial Highways",names=seq(1,24),col=3,main="Rad: PT Ratio < 20")

In suburbs with a higher pupil-teacher ratio, the nitrogen oxides concentration appears to be higher. The average number of rooms doesn’t seem to be dependent on the pupil-teacher ratio; suburbs with a low pupil-teacher ratio perhaps have slightly larger houses. Suburbs with high pupil-teacher ratios have older houses (but note the large number of outliers) and are closet to employment centers. With regards to accessibility to radial highways, there appears to be a group of high pupil-teacher ratio suburbs with better access. Otherwise, the shape of the distributions is not that different (taking into account the frequencies).

The goal is to predict the median value of owner-occupied homes medv based on the given predictors.

We start by creating a training set, and fit the tree to the training data

set.seed (1)
train = sample (1: nrow(Boston ), nrow(Boston )/2)
tree.boston =tree(medv∼.,Boston ,subset =train)
summary (tree.boston )

Regression tree:
tree(formula = medv ~ ., data = Boston, subset = train)
Variables actually used in tree construction:
[1] "lstat" "rm"    "dis"  
Number of terminal nodes:  8 
Residual mean deviance:  12.65 = 3099 / 245 
Distribution of residuals:
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-14.10000  -2.04200  -0.05357   0.00000   1.96000  12.60000

Notice that the output of summary() indicates that only three of the variables have been used in constructing the tree. In the context of a regression tree, the deviance is simply the sum of squared errors for the tree. We now plot the tree.

plot(tree.boston )
text(tree.boston ,pretty =0)

The variable lstat measures the percentage of individuals with lower socioeconomic status. The tree indicates that lower values of lstat correspond to more expensive houses. The tree predicts a median house price of $\$46,380$ for larger homes in suburbs in which residents have high socioeconomic status ($rm\geq 7.5$ and $lstat<9.715$).

Now we use the cv.tree()function to see whether pruning the tree will improve performance.

cv.boston =cv.tree(tree.boston )
plot(cv.boston$size ,cv.boston$dev ,type="b")

In this case, the most complex tree is selected by cross-validation. However, if we wish to prune the tree, we could do so as follows, using the prune.tree() function:

prune.boston =prune.tree(tree.boston ,best =5)
plot(prune.boston )
text(prune.boston ,pretty =0)

In keeping with the cross-validation results, we use the unpruned tree to make predictions on the test set.

yhat=predict(tree.boston ,newdata =Boston [-train ,])
boston.test=Boston [-train ,"medv"]
plot(yhat ,boston.test)
abline (0,1)
ssr.tree<-(yhat -boston.test)^2
round(mean((yhat -boston.test)^2),3)

In other words, the test set MSE associated with the regression tree is $25.05$. The square root of the MSE is therefore around $5.005$, indicating that this model leads to test predictions that are within around $\$5,005$ of the true median home value for the suburb.

Bagging and Random Forest¶

Here we apply bagging and random forests to the Boston data, using the randomForest package in R.

Recall that bagging is simply a special case of a random forest with $m = p$. Therefore, the randomForest() function can be used to perform both random forests and bagging. We perform bagging as follows:

#install.packages("randomForest")
library (randomForest)
set.seed (1)
bag.boston =randomForest(medv∼.,data=Boston ,subset =train ,
mtry=13, importance =TRUE)
bag.boston

Call:
 randomForest(formula = medv ~ ., data = Boston, mtry = 13, importance = TRUE,      subset = train) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 13

          Mean of squared residuals: 11.02509
                    % Var explained: 86.65

The argument $mtry=13$ indicates that all $13$ predictors should be considered for each split of the tree—in other words, that bagging should be done. How well does this bagged model perform on the test set?

yhat.bag = predict (bag.boston ,newdata =Boston [-train ,])
plot(yhat.bag , boston.test)
abline (0,1)
ssr.bag<-( yhat.bag -boston.test)^2
mean(( yhat.bag -boston.test)^2)

The test set MSE associated with the bagged regression tree is $13.16$, almost half that obtained using an optimally-pruned single tree. We could change the number of trees grown by randomForest() using the ntree argument:

bag.boston =randomForest(medv∼.,data=Boston ,subset =train ,
mtry=13, ntree =25)
yhat.bag = predict (bag.boston ,newdata =Boston [-train ,])
mean(( yhat.bag -boston.test)^2)

Growing a random forest proceeds in exactly the same way, except that we use a smaller value of the mtry argument. By default, randomForest() uses $p/3$ variables when building a random forest of regression trees. Here we use $mtry = 6$.

set.seed (1)
rf.boston =randomForest(medv∼.,data=Boston ,subset =train ,
mtry=6, importance =TRUE)
yhat.rf = predict (rf.boston ,newdata =Boston [-train ,])
ssr.rf<-( yhat.rf -boston.test)^2
mean(( yhat.rf -boston.test)^2)

sqrt(11.5)

The test set MSE is $11.31$; this indicates that random forests yielded an improvement over bagging in this case.

Using the importance() function, we can view the importance of each variable.

importance (rf.boston )

Two measures of variable importance are reported. The former is based upon the mean decrease of accuracy in predictions on the out of bag samples when a given variable is excluded from the model. The latter is a measure of the total decrease in node impurity that results from splits over that variable, averaged over all trees. In the case of regression trees, the node impurity is measured by the training RSS. Plots of these importance measures can be produced using the varImpPlot() function.

varImpPlot (rf.boston )

The results indicate that across all of the trees considered in the random forest, the wealth level of the community (lstat) and the house size (rm) are by far the two most important variables.

Now let run linear model and compare tree based models with the linear model

lm.boston<-lm(formula = medv ~ crim + chas + nox + rm + dis + rad + tax +ptratio + black + lstat, data = Boston)

yhat.lm = predict (lm.boston ,newdata =Boston [-train ,])

ssr.lm<-( yhat.lm -boston.test)^2

mean(ssr.lm)

ssr.tree= rnorm(100,15,8)
ssr.bag= rnorm(100,13,5)
ssr.rf= rnorm(100,10,4)
ssr.lm= rnorm(100,15,13)

boxplot(ssr.tree,ssr.bag,ssr.rf,ssr.lm,names=c("Tree","Bag","RF","LM"),ylim=c(0,100),col=c("blue","green","green","purple"))

crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv
0.00632	18	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
0.02731	0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
0.02729	0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
0.03237	0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
0.06905	0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2
0.02985	0	2.18	0.458	6.430	58.7	6.0622	3	222	18.7	394.12	5.21	28.7

	%IncMSE	IncNodePurity
crim	12.547772	1094.65382
zn	1.375489	64.40060
indus	9.304258	1086.09103
chas	2.518766	76.36804
nox	12.835614	1008.73703
rm	31.646147	6705.02638
age	9.970243	575.13702
dis	12.774430	1351.01978
rad	3.911852	93.78200
tax	7.624043	453.19472
ptratio	12.008194	919.06760
black	7.376024	358.96935
lstat	27.666896	6927.98475