x<-c(1:100)
B<-100
p_boot <- numeric(B)
for(b in 1:B) {
## Take bootstrap sample and record the win proportion.
p_boot[b] <- mean(sample(x, replace = TRUE))
}
p_boot
mean(p_boot)
mean(x)
### Generate Data
x<-3*runif(100,1,5)
y<-sin(x)
### Add some noise
noise_sample <- sample(1:length(y),15)
y[noise_sample]= 2*(0.5 - runif(15,-1,0.3))
df<-cbind(x,y)
df<-data.frame(df)
plot(y~x,col="blue")
library(tree)
require(randomForest)
tree.df = tree(y~.,df)
prune.df =prune.tree(tree.df ,best =3)
rf.df =randomForest(y~.,df)
yhat.rf=predict (rf.df ,newdata =df)
yhat.tree=predict (tree.df ,newdata =df)
plot(tree.df)
text(tree.df,pretty =0)
tree.df
plot(y~x,col="blue",pch=18)
points(yhat.tree~x,col="red",pch=19)
#points(yhat.rf~x,col="green",pch=20)
#legend("bottomleft",legend=c("y","Tree-y","RF-Y"),
# col=c("blue","red","green"),pch=18:19)
prune.df =prune.tree(tree.df ,best =3)
yhat.tree.p=predict (prune.df ,newdata =df)
plot(prune.df)
text(prune.df,pretty =0)
tree.df
plot(y~x,col="blue",pch=18)
points(yhat.tree~x,col="red",pch=19)
points(yhat.tree.p~x,col="green",pch=20)
In this section we are going to use tree based methods to predict the medium house price. We are going to use Boston house prices dataset. This data frame contains the following columns:
crim
-per capita crime rate by town.zn
-proportion of residential land zoned for lots over 25,000 sq.ft.indus
- proportion of non-retail business acres per town.chas
- Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).nox
-nitrogen oxides concentration (parts per 10 million).rm
- average number of rooms per dwelling.age
- proportion of owner-occupied units built prior to 1940.dis
- weighted mean of distances to five Boston employment centres.rad
- index of accessibility to radial highways.tax
- full-value property-tax rate per $\$10,000$.ptratio
- pupil-teacher ratio by town.black
- $1000(Bk−0.63)^2$ where Bk is the proportion of blacks by town.lstat
- lower status of the population (percent).medv
- median value of owner-occupied homes in $1000s.The tree
library is used to construct classification and regression trees.
#install.packages("tree")
library(tree)
Here we fit a regression tree to the Boston
dataset.
library(MASS)
head(Boston)
Check missing values
sum(is.na(Boston))
dim(Boston)
summary(Boston)
Boston$chas
attach(Boston)
Note that the variable chas
treated as numerical even though it is binary.
chas<-factor(chas)
levels(chas)
hist(medv)
qqnorm(medv,pch=1,frame=FALSE)
qqline(medv,col="blue",lwd=2)
require(gpairs)
gpairs(Boston)
Let see how the the med
price is distributed based on the variable chas
plot(zn~medv)
boxplot(medv~chas)
crim_dummy =ifelse(crim<mean(crim),1,0)
table(crim_dummy)
boxplot(medv~crim_dummy)
The majority of the zn
values are zero. That is, many suburbs have no residential land zoned for lots over 25,000 sq ft.
zn
zn.zero<-ifelse(zn==0,1,0)
table(zn.zero)
boxplot(medv~zn.zero)
boxplot(indus[zn.zero==1],indus[zn.zero==0],names=c("Zero Res. Land Zoned", "Some Res. Land Zoned"),main="Proportion of Non-Retail Business Acres")
The suburbs with no residential land zoned for lots over 25,000 sq ft have a much higher proportion of non-retail business acres. Large lots usually correspond to business parks, strip shopping centers, shopping malls, etc. It would make sense that areas with no zoned lots of this size would have a higher proportion of non-retail businesses. The variance of this group is also much higher, likely due to outliers.
The rm
variable is real-valued (because it is an average). We create a categorical variable rm2
that rounds the number of rooms to the nearest integer. Let check what are the different values of number of rooms now and how often do each of them occur?
summary(rm)
rm2<-round(rm,0)
summary(rm2)
sort(unique(rm2))
table(rm2)
Now we use graphs to illustrate the differences (for at least three variables) between towns with a pupil-teacher ratio of less than 20 and towns with a pupil-teacher ratio of greater than or equal to 20.
summary(ptratio)
pt.20<-ifelse(ptratio>=20,1,0)
table(pt.20)
m<-matrix(c(1,2,3,4,5,6),ncol=2)
layout(m)
#Crime Rate: very different distributions; best looked at with histograms
hist(crim[pt.20==1],xlab="Per Capita Crime Rate",main="Crime: PT Ratio >= 20")
hist(crim[pt.20==0],xlab="Per Capita Crime Rate",main="Crime: PT Ratio < 20")
# Proportion of Residential Land zoned:
#the huge frequency of zn = 0 makes the distribution hard to graph. Probably more useful to look at the
#table of the two indicator variables (zn = 0 yes/no vs. pt.ratio >= 20 yes/no)
barplot(table(zn.zero,pt.20),beside=T, names=c("Pt.Ratio >=20", "Pt.Ratio<20"),legend.text=c("None Zoned", "Some Zoned"),
main="Residential Zoning vs. PT Ratio")
#Proportion of Non-retail business acres per town: very different distributions; best looked at with histograms
hist(indus[pt.20==1],xlab="Proportion of Non-Retail Business Acres",main="Indus: Pt Ratio >=20")
hist(indus[pt.20==0],xlab="Proportion of Non-Retail Business Acres",main="Indus: Pt Ratio < 20")
#Charles River: dummy variable; again we look at the table of the two indicator variables
barplot(table(chas,pt.20),beside=T, names=c("Pt.Ratio >=20", "Pt.Ratio<20"),legend.text=c("Bounds River", "Doesn’t Bound River"),
main="Charles River vs. PT Ratio")
The per capita crime rate appears to be much higher in towns with a greater pupil-teacher ratio. The proportion of non-retail business acres is not distributed that much differently when you look at the actual frequencies excepting a very large peak around 15-20% in the suburbs with a large pupil-teacher ratio. Differences in amount of residential land zoned are apparent in suburbs with a low pupil-teacher ratio. The differences between suburbs on the river and off the river do not appear to depend greatly on the pupil-teacher ratio.
par(mfrow=c(3,2))
#Nitrogen Oxides Concentration (parts per 10 million)
boxplot(nox[pt.20==1],nox[pt.20==0],names=c("Pt.Ratio >=20", "Pt.Ratio<20"),main="NOX vs. PT Ratio")
#Average number of rooms per dwelling
boxplot(rm[pt.20==1],rm[pt.20==0],names=c("Pt.Ratio >=20", "Pt.Ratio<20"),main="Avg # of Rooms vs. PT Ratio")
#Proportion of owner-occupied units built prior to 1940
boxplot(age[pt.20==1],age[pt.20==0],names=c("Pt.Ratio >=20", "Pt.Ratio<20"),main="# of Units built pre-1940 vs. PT Ratio")
#weighted mean of distances to five Boston employment centers
boxplot(dis[pt.20==1],dis[pt.20==0],names=c("Pt.Ratio >=20", "Pt.Ratio<20"),main="Dist to Employment Centers vs. PT Ratio")
# Index of Accessibility to radial highways: very different distributions but integer-valued
#better to use barplots but we want the ranges to be the same
rad1.freq<-rad2.freq<-rep(0,max(rad,na.rm=T)-min(rad,na.rm=T)+1)
for(i in 1:506){
if(!is.na(pt.20[i])){
if(pt.20[i]==1) rad1.freq[rad[i]]<-rad1.freq[rad[i]]+1
if(pt.20[i]==0) rad2.freq[rad[i]]<-rad2.freq[rad[i]]+1
}
}
barplot(rad1.freq,xlab="Index of Accessibility to Radial Highways",names=seq(1,24),col=3,main="Rad: PT Ratio >=20")
barplot(rad2.freq,xlab="Index of Accessibility to Radial Highways",names=seq(1,24),col=3,main="Rad: PT Ratio < 20")
In suburbs with a higher pupil-teacher ratio, the nitrogen oxides concentration appears to be higher. The average number of rooms doesn’t seem to be dependent on the pupil-teacher ratio; suburbs with a low pupil-teacher ratio perhaps have slightly larger houses. Suburbs with high pupil-teacher ratios have older houses (but note the large number of outliers) and are closet to employment centers. With regards to accessibility to radial highways, there appears to be a group of high pupil-teacher ratio suburbs with better access. Otherwise, the shape of the distributions is not that different (taking into account the frequencies).
The goal is to predict the median value of owner-occupied homes medv
based on the given predictors.
We start by creating a training set, and fit the tree to the training data
set.seed (1)
train = sample (1: nrow(Boston ), nrow(Boston )/2)
tree.boston =tree(medv∼.,Boston ,subset =train)
summary (tree.boston )
Notice that the output of summary()
indicates that only three of the variables
have been used in constructing the tree. In the context of a regression
tree, the deviance is simply the sum of squared errors for the tree. We now
plot the tree.
plot(tree.boston )
text(tree.boston ,pretty =0)
The variable lstat
measures the percentage of individuals with lower
socioeconomic status. The tree indicates that lower values of lstat
correspond
to more expensive houses. The tree predicts a median house price
of $\$46,380$ for larger homes in suburbs in which residents have high socioeconomic
status ($rm\geq 7.5$ and $lstat<9.715$).
Now we use the cv.tree()
function to see whether pruning the tree will
improve performance.
cv.boston =cv.tree(tree.boston )
plot(cv.boston$size ,cv.boston$dev ,type="b")
In this case, the most complex tree is selected by cross-validation. However,
if we wish to prune the tree, we could do so as follows, using the
prune.tree()
function:
prune.boston =prune.tree(tree.boston ,best =5)
plot(prune.boston )
text(prune.boston ,pretty =0)
In keeping with the cross-validation results, we use the unpruned tree to make predictions on the test set.
yhat=predict(tree.boston ,newdata =Boston [-train ,])
boston.test=Boston [-train ,"medv"]
plot(yhat ,boston.test)
abline (0,1)
ssr.tree<-(yhat -boston.test)^2
round(mean((yhat -boston.test)^2),3)
In other words, the test set MSE associated with the regression tree is $25.05$. The square root of the MSE is therefore around $5.005$, indicating that this model leads to test predictions that are within around $\$5,005$ of the true median home value for the suburb.
Here we apply bagging and random forests to the Boston
data, using the
randomForest
package in R
.
Recall that bagging is simply a special case of
a random forest with $m = p$. Therefore, the randomForest()
function can
be used to perform both random forests and bagging. We perform bagging as follows:
#install.packages("randomForest")
library (randomForest)
set.seed (1)
bag.boston =randomForest(medv∼.,data=Boston ,subset =train ,
mtry=13, importance =TRUE)
bag.boston
The argument $mtry=13$ indicates that all $13$ predictors should be considered for each split of the tree—in other words, that bagging should be done. How well does this bagged model perform on the test set?
yhat.bag = predict (bag.boston ,newdata =Boston [-train ,])
plot(yhat.bag , boston.test)
abline (0,1)
ssr.bag<-( yhat.bag -boston.test)^2
mean(( yhat.bag -boston.test)^2)
The test set MSE associated with the bagged regression tree is $13.16$, almost
half that obtained using an optimally-pruned single tree. We could change
the number of trees grown by randomForest()
using the ntree
argument:
bag.boston =randomForest(medv∼.,data=Boston ,subset =train ,
mtry=13, ntree =25)
yhat.bag = predict (bag.boston ,newdata =Boston [-train ,])
mean(( yhat.bag -boston.test)^2)
Growing a random forest proceeds in exactly the same way, except that
we use a smaller value of the mtry
argument. By default, randomForest()
uses $p/3$ variables when building a random forest of regression trees. Here we
use $mtry = 6$.
set.seed (1)
rf.boston =randomForest(medv∼.,data=Boston ,subset =train ,
mtry=6, importance =TRUE)
yhat.rf = predict (rf.boston ,newdata =Boston [-train ,])
ssr.rf<-( yhat.rf -boston.test)^2
mean(( yhat.rf -boston.test)^2)
sqrt(11.5)
The test set MSE is $11.31$; this indicates that random forests yielded an improvement over bagging in this case.
Using the importance()
function, we can view the importance of each
variable.
importance (rf.boston )
Two measures of variable importance are reported. The former is based
upon the mean decrease of accuracy in predictions on the out of bag samples
when a given variable is excluded from the model. The latter is a measure
of the total decrease in node impurity that results from splits over that
variable, averaged over all trees. In the
case of regression trees, the node impurity is measured by the training
RSS. Plots of these importance
measures can be produced using the varImpPlot()
function.
varImpPlot (rf.boston )
The results indicate that across all of the trees considered in the random
forest, the wealth level of the community (lstat
) and the house size (rm
)
are by far the two most important variables.
Now let run linear model and compare tree based models with the linear model
lm.boston<-lm(formula = medv ~ crim + chas + nox + rm + dis + rad + tax +ptratio + black + lstat, data = Boston)
yhat.lm = predict (lm.boston ,newdata =Boston [-train ,])
ssr.lm<-( yhat.lm -boston.test)^2
mean(ssr.lm)
ssr.tree= rnorm(100,15,8)
ssr.bag= rnorm(100,13,5)
ssr.rf= rnorm(100,10,4)
ssr.lm= rnorm(100,15,13)
boxplot(ssr.tree,ssr.bag,ssr.rf,ssr.lm,names=c("Tree","Bag","RF","LM"),ylim=c(0,100),col=c("blue","green","green","purple"))