#install.packages("psych")
#install.packages("gpairs")
#install.packages("gplots")
#install.packages("coefplot")
#install.packages("car")
#install.packages("corrplot")
#install.packages("alr3")
#install.packaged("MASS")
#install.packages("mlogit")
library(MASS)
library(alr3)
require(car)
require("psych")
require(gpairs)
require(corrplot)
require(gplots)
require(coefplot)
setwd("C:\\Users\\armop\\Dropbox\\PHD\\Teaching\\AGBU505\\2019\\Lecture1") ##Start by defining directory
df=read.table("amus_park.csv",sep=",",header=TRUE)
After loading the data start the usual procedure to check the correctness of the loaded data. Check whether the loaded data has correct dimensions, print the first and last few observations.
dim(df)
head(df)
tail(df)
Check for NAN or missing values
sum(is.na(df)) #Check for NAN or missing values
It is good idea to summerize your data and investigate it.
describe(df)
or you can use different command
summary(df)
For summerizing discrete data use function table()
table(df$weekend) #for discrete data
Before modeling, there are two important things to check: that each individual variable has a reasonable distribution, and that joint relationships among the variables are appropriate for modeling. Start by visualizing data.
#Visualize the data
pairs(formula = ∼ weekend + num.child + distance +rides + games + wait + clean
+ overall, data=df,main="")
gpairs(df)
The difference of the above two function is on how they handle the discrete values.
To check the relationships among variables, we examine the bivariate scatterplots shown Figure above. They show few concerns apart from the need to transform distance. For example, the pairwise scatterplots of our continuous measures are generally elliptical in shape, which is a good indication that they are appropriate to use in a linear model. One question, however, concerns the fact that the variables in the lower right of figure are positively correlated.
df$logdist=log(df$distance)
gpairs(df[,-3])
head(df)
Although scatterplots provide a lot of visual information, when there are more than a few variables, it can be helpful to assess the relationship between each pair with a single number.
cor(df[,2:8])
corrplot.mixed(corr=cor(df[ , 2:8], use="complete.obs"),upper="ellipse", tl.pos="lt")
A correlation plot produced using corrplot.mixed() from the corrplot package is an easy way to visualize all of the correlations in the data. Numeric values of r are shown in the lower triangle of the matrix. The upper triangle displays ellipses (because we used the argument upper="ellipse"). These ellipses are tighter, progressively closer to being lines, for larger values of r, and are rounder, more like circles for r near zero. They are also shaded blue for positive direction, and red for negative (and show corresponding positive or negative slope).
Correlation coefficient r measures the linear association between two variables. If the relationship between two variables is not linear, it would be misleading to interpret r.
Now we are ready for model selection and validation process. We start with the Linear Model with a Single Predictor.
The goal of a satisfaction drivers analysis is to discover relationships between customers’ satisfaction with features of the service (or product) and their overall experience. For example, to what extent is satisfaction with the park’s rides related to overall experience? Is the relationship strong or weak?
Like most other R functions, lm() returns an object that we can save and use for other purposes. Typically, we assign the result of lm() to an object that is used in subsequent lines of code. For example, we can assign the result of lm() to a new object m1:
m1=lm(overall ∼ rides, data=df)
In the output, R repeats the model for reference and reports two coefficients, which are the intercept and the slope of the fitted line. Those can be used to determine the best estimate for any respondent’s report of overall based on knowing his or her value for rides.
m1$coefficients
typeof(m1$coefficients)
For example, from this model we would expect that a customer who gives a rating of 90 for satisfaction with rides would give an overall rating of:
m1$coefficients[1]+m1$coefficients[2]*90
plot(overall ∼ rides, data=df,xlab="Satisfaction with Rides", ylab="Overall Satisfaction")
abline(m1, col="blue")
The result is shown in abline() recognizes that it is dealing with an lm object and uses the slope and the intercept from m1 to draw the line.
If you want more details about output
summary(m1)
t_value <- 1.7146/ 0.1085
print(t_value)
p_value<-1-pt(t_value, df=498)
print(p_value)
alpha=0.05
conf_int_l<-1.7146 -qt(alpha/2,498)*0.1085
conf_int_u<-1.7146 + qt(alpha/2,498)*0.1085
round(conf_int_l,3)
round(conf_int_u,3)
round(confint(m1),3)
Cecking Model Validity and Model Fit
par(mfrow=c(2,2))
plot(m1)
I do not recommend routinely remove outliers yet I do recommend to inspect them and determine whether there is a problem with the data. Let inspect the identified points by selecting those rows:
df[c(57, 129, 295),]
In this case, none of the data points is obviously invalid (for instance, with values below 1 or greater than 100). Overall, figure looks good and suggests that the model relating overall satisfaction to satisfaction with rides is reasonable.
Now that we’ve covered the basics of linear models using just one predictor, we turn to the problem of assessing multiple drivers of satisfaction. Our goal is to sort through all of the features of the park—rides, games, wait times, and cleanliness—to determine which ones are most closely related to overall satisfaction.
m2 <- lm(overall ∼ rides + games + wait + clean, data=df)
summary(m2)
The R-squared increased to 0.5586, meaning that about half of the variation in overall ratings is explained by the ratings for specific features. The residual standard error is now 10.59, meaning that the predictions are more accurate. Our residuals also appear to be symmetric.
par(mfrow=c(2,2))
plot(m2)
Next we examine the model coefficients. Each coefficient represents the strength of the relationship between satisfaction with that feature and overall satisfaction, conditional on the values of the other predictors. All four features are identified as being statistically significant.
coefplot(m2, intercept=FALSE, outerCI=1.96, lwdOuter=1.5, ylab="Rating of Feature",xlab="Association with Overall Satisfaction")
We use coefplot() arguments to set the outer confidence interval to a width
of 1.96 standard errors (using outerCI=1.96, which corresponds to a 95%
confidence interval) and to increase the size of the plotted lines slightly with
lwdOuter=1.5.
The result is shown in Figure, where we see that satisfaction with cleanliness is estimated
to be the most important feature associated with overall satisfaction, followed
by satisfaction with the rides and wait times. Satisfaction with games is estimated
to be relatively less important.
A plot of coefficients is often a key output from a satisfaction drivers analysis. Sorting
the plot so that the coefficients are in order based on their estimated coefficient
may make it easier to quickly identify the features that are most closely related to
overall satisfaction if you have a large number of predictors.
Now that we have two model objects, m1 and m2 we might ask which one is better. One way to evaluate models is to compare their R-squared values.
summary(m1)$adj.r.squared
summary(m2)$adj.r.squared
and AIC criterion, which is given by: $$AIC = -2\log L + k \times edf ,$$ where $L$ is likelihood function (in our case based on normal distribution) and $edf$ is the number of free parameters for regression model. In OLS it is equal number of independent variables.
extractAIC(m1,k=2)
extractAIC(m2,k=2)
To compare the predictions of the models visually, we plot the fitted versus actual values for each:
plot(df$overall, fitted(m1), col="red",xlim=c(0,100), ylim=c(0,100), xlab="Actual Overall Satisfaction", ylab="Fitted Overall Satisfaction")
points(df$overall, fitted(m2), col="blue")
legend("topleft", legend=c("model 1", "model 2"),col=c("red", "blue"), pch=1)
If the model fits the data perfectly, it would fall along a 45◦ line in this plot, but, of course, it is nearly impossible to fit customer satisfaction data perfectly. By comparing the red and the blue points in the resulting plot in figure, you can see that the blue cloud of points is more tightly clustered along a diagonal line, which shows that m2 explains more of the variation in the data than m1.
For a more formal test, which is possible because the models here are nested. We can use anova() function to determine whether m2 explains more of the variation than m1:
anova(m1, m2)
The low p-value indicates that the additional predictors in m2 significantly improve the fit of the model. If these two models were the only ones under consideration, we would interpret m2 instead of m1.
Thus far, we have interpreted raw coefficients in order to evaluate the contributions
of ratings on the shared 100-point scale. However, if the variables have different
scales, such as a survey where rides is rated on a 1–10 scale while cleanliness is
rated 1–5 scale, then their coefficient values would not be directly comparable. In
the present data, this occurs with the distance and logdist variables, which
are not on a 100-point scale.
When you wish to compare coefficients, it can be helpful to standardize data on a
common scale before fitting a model (and after transforming any variables to a more
normal scale). The most common standardization converts values to zero-centered
units of standard deviation. This subtracts a variable’s mean from each observation
and then divides by the standard deviation (sd()). This could be done by:
df.std <- df[ , -3] # sat but remove distance
df.std[ , 3:8] <- scale(df.std[ , 3:8])
head(df.std)
For the next step, we wonder whether satisfaction is different for customers who come on the weekend, travel farther, or have more children.We add these predictors to the model using the standardized data:
m3 <- lm(overall ∼ rides + games + wait + clean + weekend + logdist + num.child, data = df.std)
summary(m3)
When your data includes factors, you must be careful about the data type. For example,
num.child is a numeric variable, ranging 0–5, but it doesn’t necessarily
make sense to treat it as a number, as we did in m3.
In doing so, we implicitly assume
that satisfaction goes up or down linearly as a function of the number of children,
and that the effect is the same for each additional child. (Anyone who has taken a
group of children to an amusement park might guess that this is an unreasonable
assumption.)
We correct this by converting num.child to a factor and re-estimating the
model:
df.std$num.child.factor <- factor(df.std$num.child)
m4 <- lm(overall ∼ rides + games + wait + clean + weekend + logdist + num.child.factor, data=df.std)
summary(m4)
A striking thing about m4 is that the increase in overall satisfaction is about the
same regardless of how many children there are in the party—about one standard
deviation higher for any number of children. This suggests that we don’t actually
need to estimate a different increase for each number of children. In fact, if the
increase is the same for one child as for five children, attempting to fit a model that
scales increasingly per child would result in a less accurate estimate.
Instead, we declare a new variable called has.child that is TRUE when the party
has children in it and FALSE when the party does not have children. We then
estimate the model using that new factor variable.
We also drop weekend from the
model because it doesn’t seem to be a significant predictor:
df.std$has.child <- factor(df.std$num.child > 0)
m5 <- lm(overall ∼ rides + games + wait + clean + logdist + has.child,data=df.std)
summary(m5)
We can include an interaction of two terms by using the : operator between variables
in a formula.
For instance, to estimate overall as a function of rides plus
the interaction of wait and has.child, we could write the formula as overall
∼ rides + wait:no.child.
m6 <- lm(overall ∼ rides + games + wait + clean + weekend + logdist + has.child + rides:has.child +
games:has.child + wait:has.child + clean:has.child + rides:weekend + games:weekend
+ wait:weekend + clean:weekend, data=df.std)
summary(m6)
The model object m6 now includes eight interaction terms between ratings for features
of the park and no.child and weekend. Only one of these interactions
is significant: the wait:no.child interaction.
This suggests we could drop the
non-significant interactions to create a new model m7:
m7 <- lm(overall ∼ rides + games + wait + clean + logdist + has.child + wait:has.child, data=df.std)
summary(m7)
In these results, we see that attending the park with children is a predictor of higher satisfaction, and waiting time is more important predictor among those with children (wait:has.childTRUE) than those without children. We don’t know the reason for this, but perhaps children go on more rides and their parents are therefore more influenced by wait times.
One might further tune the model by considering whether logdist is still needed; I’ll leave that you to verify that and assume that model m7 is the final model.
What do we do with these results as marketers? We identify several possible marketing interventions.
There are many more such questions one could pose from results like these; a crucial step in analysis is to think carefully about the implications and where one might be able to make a product or market intervention.
Important to remember that the model assesses association, not causation. Possible changes in outcome should be viewed as hypotheses suggested by the model, to be confirmed separately.
coefplot(m7, intercept=FALSE, outerCI=1.96, lwdOuter=1.5,ylab="Rating of Feature",xlab="Association with Overall Satisfaction")
Marginal Plots provide cool way to validate the functional form of the features.
mmps(m7,layout=c(2,3))
par(mfrow=c(1,1))
inverse.response.plot(m1,key=TRUE)
#Model=1
n <- length(m1$residuals)
npar <- length(m1$coefficients) +1
#Calculate adjusted R_Square
R_sq_m1<-summary(m1)$adj.r.squared
#Calculate AICc
AIC_m1<-extractAIC(m1,k=2)+2*npar*(npar+1)/(n-npar-1)
#Calculate BIC
BIC_m1<-extractAIC(m1,k=log(n))
#Model=2
npar <- length(m2$coefficients) +1
#Calculate adjusted R_Square
R_sq_m2<-summary(m2)$adj.r.squared
#Calculate AICc
AIC_m2<-extractAIC(m2,k=2)+2*npar*(npar+1)/(n-npar-1)
#Calculate BIC
BIC_m2<-extractAIC(m2,k=log(n))
#Model=5
npar <- length(m5$coefficients) +1
#Calculate adjusted R_Square
R_sq_m5<-summary(m5)$adj.r.squared
#Calculate AICc
AIC_m5<-extractAIC(m5,k=2)+2*npar*(npar+1)/(n-npar-1)
#Calculate BIC
BIC_m5<-extractAIC(m5,k=log(n))
#Model=6
npar <- length(m6$coefficients) +1
#Calculate adjusted R_Square
R_sq_m6<-summary(m6)$adj.r.squared
#Calculate AICc
AIC_m6<-extractAIC(m6,k=2)+2*npar*(npar+1)/(n-npar-1)
#Calculate BIC
BIC_m6<-extractAIC(m6,k=log(n))
#Model=6
npar <- length(m7$coefficients) +1
#Calculate adjusted R_Square
R_sq_m7<-summary(m7)$adj.r.squared
#Calculate AICc
AIC_m7<-extractAIC(m7,k=2)+2*npar*(npar+1)/(n-npar-1)
#Calculate BIC
BIC_m7<-extractAIC(m7,k=log(n))
R_square<-cbind(R_sq_m1,R_sq_m2,R_sq_m5,R_sq_m6,R_sq_m7)
AIC<-cbind(AIC_m1[2],AIC_m2[2],AIC_m5[2],AIC_m6[2],AIC_m7[2])
BIC<-cbind(BIC_m1[2],BIC_m2[2],BIC_m5[2],BIC_m6[2],BIC_m7[2])
table<-rbind(R_square,AIC,BIC)
table
Suppose now you want to choose the model based on the measure how good is your model predicting. In the regression setting, the most commonly-use measure is the mean squared error(MSE), given by $$MSE = \frac{1}{n}\sum_{i=1}^n(y_i - \hat f(x_i))^2,$$ where $\hat f(x_i)$ is the prediction that $hat f$ gives for the $i$th observation. The MSE will be small if the predicted responses are very close to the true responses and vice versa.
Important If you compute estimate to previously seen data set (training data), then it should be comparably small. But in general, we do not really care how well the method works on the training data. Rather, we are interested in the accuracy of the predictions that we obtain when we apply out method to previously unseen test data. When your dataset is small and you do not have luxury to divide your data into two parts. One of the recommended method is $K-$fold crossvalidation
In this case we randomly divide the set of observations into $k$ groups, or folds, of approximately equal size. Tje first fold is treated as a validation set, and the method is fit on the remaining $k-1$ folds. The $MSE_1$ is them computed on the observations in the held-out fold. This procedure is repeated $k$ times; each time, a different group of observations is treated as a validation set. This process results in $k$ estimates of the error, $MSE_1,MSE_2,\dots,MSE_k$. The $k-$fold CV is computed by averaging values, $$CV_{(k)}=\frac{1}{k}\sum_{i=1}^k MSE_i$$
dim(df)
MSE_m5<-deviance(m5)/nobs(m5)
print(MSE_m5)
MSE_m6<-deviance(m6)/nobs(m6)
print(MSE_m6)
MSE_m7<-deviance(m7)/nobs(m7)
print(MSE_m7)
cvlm<-function(n_folds=5,model,df)
{
data<-df
K<-n_folds
model<-model
form<-formula(model)
formtxt <- deparse(form)
mf <- model.frame(form, data = data)
ynam <- attr(mf, "names")[attr(attr(mf, "terms"), "response")]
n_fold<-5
sum.sq<-numeric(n_fold)
fold_i <-rep(1:n_fold,length.out=dim(data)[1])
for ( i in 1:n_fold)
{
test_i <-which(fold_i==i)
n<-length(test_i)[1]
test<-data[test_i,]
train<-data[-test_i,]
test_x<-mf[test_i,]
model_i<-lm(form,data=train)
sum.sq[i]<-sum((test$overall-predict(model_i,newdata=test_x))^2)/n
}
SSE<-sum.sq
MSE<-sum(sum.sq)/K
return<-list("MSE"=MSE,"SSE"=SSE)
return(return)
}
MSE.m5<-cvlm(5,m5,df.std)$MSE
MSE.m6<-cvlm(5,m6,df.std)$MSE
MSE.m7<-cvlm(5,m7,df.std)$MSE
c(MSE.m5,MSE.m6,MSE.m7)
SSE.m5<-cvlm(5,m5,df.std)$SSE
SSE.m6<-cvlm(5,m6,df.std)$SSE
SSE.m7<-cvlm(5,m7,df.std)$SSE
boxplot(SSE.m5,SSE.m6,SSE.m7,names=c("M5","M6","M7"),col=c("blue","green","purple"))
We followed a lengthy process to arrive at the final model m7, and it is helpful to recount the general steps we recommend in creating such a linear model.
1. Inspect the data to make sure it is clean and has the structure you expect.
2. Check the distributions of the variables to make sure they are not highly skewed
If one is skewed, consider transforming it.
3. Examine the bivariate scatterplots and correlation matrix to see
whether there are any extremely correlated variables (such as r > 0.9, or several
with r > 0.8). If so, omit some variables or consider transforming them if
needed.
4. If you wish to estimate coefficients on a consistent scale, standardize the data
with scale() (Sect. 7.3.3).
5. After fitting a model, check the residual quantiles in the output. The residuals
show how well the model accounts for the individual observations (Sect. 7.2.4).
6. Check the standard model plots using plot(), which will help you judge
whether a linear model is appropriate or whether there is nonlinearity, and will
identify potential outliers in the data (Sect. 7.2.4).
7. Try several models and compare them for overall interpretability and model fit
by inspecting the residuals’ spread and overall R2 (Sect. 7.3.1). If the models
are nested, you could also use anova() for comparison (Sect. 6.5.1) .
8. Report the confidence intervals of the estimates with your interpretation and
recommendations (Sect. 7.3).