In [2]:
#install.packages("psych")
#install.packages("gpairs")
#install.packages("gplots")
#install.packages("coefplot")
#install.packages("car")
#install.packages("corrplot")
#install.packages("alr3")
#install.packaged("MASS")
#install.packages("mlogit")
library(MASS)
library(alr3)
require(car)
require("psych")
require(gpairs)
require(corrplot)
require(gplots)
require(coefplot)
In [2]:
setwd("C:\\Users\\armop\\Dropbox\\PHD\\Teaching\\AGBU505\\2019\\Lecture1") ##Start by defining directory
df=read.table("amus_park.csv",sep=",",header=TRUE)

After loading the data start the usual procedure to check the correctness of the loaded data. Check whether the loaded data has correct dimensions, print the first and last few observations.

In [3]:
dim(df)
  1. 500
  2. 8
In [4]:
head(df)
weekendnum.childdistanceridesgameswaitcleanoverall
yes 0 114.6482682 68 64 86 49
yes 2 27.0141082 73 80 84 68
no 1 63.3009879 75 74 86 62
yes 0 25.9099382 68 69 87 38
no 4 54.7183178 83 78 84 69
no 5 22.6793476 75 51 77 29
In [5]:
tail(df)
weekendnum.childdistanceridesgameswaitcleanoverall
495no 5 41.4701078 80 80 87 57
496no 0 11.0525885 67 72 87 48
497yes 0 8.1877486 79 85 89 48
498no 2 45.1774090 88 89 91 74
499no 3 27.0883878 79 84 86 57
500no 1 38.4087681 84 81 83 65

Check for NAN or missing values

In [6]:
sum(is.na(df)) #Check for NAN or missing values
0

It is good idea to summerize your data and investigate it.

In [7]:
describe(df)
varsnmeansdmediantrimmedmadminmaxrangeskewkurtosisse
weekend*1 500 1.48200 0.5001763 1.00000 1.47750 0.00000 1.0000000 2.0000 1.0000 0.071830669-1.9988260 0.02236857
num.child2 500 1.73800 1.4959384 2.00000 1.61000 1.48260 0.0000000 5.0000 5.0000 0.437142150-0.7509303 0.06690040
distance3 500 31.04751 33.1454452 19.01909 24.64603 17.25512 0.5267228 239.1921 238.6654 2.574233780 8.9959900 1.48230937
rides4 500 80.38200 5.4150458 80.00000 80.35250 5.93040 67.0000000 94.0000 27.0000 0.060117097-0.4848875 0.24216821
games5 500 74.14400 8.1249797 74.00000 74.20000 8.89560 52.0000000 96.0000 44.0000 -0.049533738-0.3451199 0.36336014
wait6 500 73.41600 10.7518439 74.00000 73.50500 11.86080 43.0000000 103.0000 60.0000 -0.075266072-0.2293318 0.48083708
clean7 500 85.39200 5.1333262 85.00000 85.38750 5.93040 72.0000000 98.0000 26.0000 0.008375348-0.4473279 0.22956933
overall8 500 52.87200 16.0763339 52.00000 52.51000 16.30860 6.0000000 102.0000 96.0000 0.196187378-0.1597183 0.71895551

or you can use different command

In [8]:
summary(df)
 weekend     num.child        distance            rides           games      
 no :259   Min.   :0.000   Min.   :  0.5267   Min.   :67.00   Min.   :52.00  
 yes:241   1st Qu.:0.000   1st Qu.: 10.3181   1st Qu.:77.00   1st Qu.:68.00  
           Median :2.000   Median : 19.0191   Median :80.00   Median :74.00  
           Mean   :1.738   Mean   : 31.0475   Mean   :80.38   Mean   :74.14  
           3rd Qu.:3.000   3rd Qu.: 39.5821   3rd Qu.:84.00   3rd Qu.:80.00  
           Max.   :5.000   Max.   :239.1921   Max.   :94.00   Max.   :96.00  
      wait            clean          overall      
 Min.   : 43.00   Min.   :72.00   Min.   :  6.00  
 1st Qu.: 65.75   1st Qu.:82.00   1st Qu.: 41.75  
 Median : 74.00   Median :85.00   Median : 52.00  
 Mean   : 73.42   Mean   :85.39   Mean   : 52.87  
 3rd Qu.: 81.00   3rd Qu.:89.00   3rd Qu.: 63.00  
 Max.   :103.00   Max.   :98.00   Max.   :102.00  

For summerizing discrete data use function table()

In [9]:
table(df$weekend) #for discrete data
 no yes 
259 241 

Before modeling, there are two important things to check: that each individual variable has a reasonable distribution, and that joint relationships among the variables are appropriate for modeling. Start by visualizing data.

In [10]:
#Visualize the data
pairs(formula = ∼ weekend +  num.child + distance   +rides  + games + wait + clean
      + overall, data=df,main="")
In [11]:
gpairs(df)
Loading required package: grid
Loading required package: lattice

The difference of the above two function is on how they handle the discrete values.

To check the relationships among variables, we examine the bivariate scatterplots shown Figure above. They show few concerns apart from the need to transform distance. For example, the pairwise scatterplots of our continuous measures are generally elliptical in shape, which is a good indication that they are appropriate to use in a linear model. One question, however, concerns the fact that the variables in the lower right of figure are positively correlated.

In [12]:
df$logdist=log(df$distance)
In [13]:
gpairs(df[,-3])
In [14]:
head(df)
weekendnum.childdistanceridesgameswaitcleanoveralllogdist
yes 0 114.6482682 68 64 86 49 4.741869
yes 2 27.0141082 73 80 84 68 3.296359
no 1 63.3009879 75 74 86 62 4.147901
yes 0 25.9099382 68 69 87 38 3.254626
no 4 54.7183178 83 78 84 69 4.002198
no 5 22.6793476 75 51 77 29 3.121454

Although scatterplots provide a lot of visual information, when there are more than a few variables, it can be helpful to assess the relationship between each pair with a single number.

In [15]:
cor(df[,2:8])
num.childdistanceridesgameswaitcleanoverall
num.child 1.00000000 -0.012136454-0.027697383 0.012343448-0.020994861-0.0079981400.34242021
distance-0.01213645 1.000000000-0.008598235-0.008036711-0.005848262 0.0029190320.08829766
rides-0.02769738 -0.008598235 1.000000000 0.453413358 0.315824769 0.7890047480.57772686
games 0.01234345 -0.008036711 0.453413358 1.000000000 0.299620997 0.5164607480.43325421
wait-0.02099486 -0.005848262 0.315824769 0.299620997 1.000000000 0.3668495040.56539527
clean-0.00799814 0.002919032 0.789004748 0.516460748 0.366849504 1.0000000000.62814674
overall 0.34242021 0.088297655 0.577726856 0.433254206 0.565395275 0.6281467421.00000000
In [16]:
corrplot.mixed(corr=cor(df[ , 2:8], use="complete.obs"),upper="ellipse", tl.pos="lt")

A correlation plot produced using corrplot.mixed() from the corrplot package is an easy way to visualize all of the correlations in the data. Numeric values of r are shown in the lower triangle of the matrix. The upper triangle displays ellipses (because we used the argument upper="ellipse"). These ellipses are tighter, progressively closer to being lines, for larger values of r, and are rounder, more like circles for r near zero. They are also shaded blue for positive direction, and red for negative (and show corresponding positive or negative slope).

Correlation coefficient r measures the linear association between two variables. If the relationship between two variables is not linear, it would be misleading to interpret r.

Modeling

Now we are ready for model selection and validation process. We start with the Linear Model with a Single Predictor.

The goal of a satisfaction drivers analysis is to discover relationships between customers’ satisfaction with features of the service (or product) and their overall experience. For example, to what extent is satisfaction with the park’s rides related to overall experience? Is the relationship strong or weak?

Like most other R functions, lm() returns an object that we can save and use for other purposes. Typically, we assign the result of lm() to an object that is used in subsequent lines of code. For example, we can assign the result of lm() to a new object m1:

In [17]:
m1=lm(overall ∼ rides, data=df)

In the output, R repeats the model for reference and reports two coefficients, which are the intercept and the slope of the fitted line. Those can be used to determine the best estimate for any respondent’s report of overall based on knowing his or her value for rides.

In [18]:
m1$coefficients
(Intercept)
-84.9968794504225
rides
1.71517105136004
In [19]:
typeof(m1$coefficients)
'double'

For example, from this model we would expect that a customer who gives a rating of 90 for satisfaction with rides would give an overall rating of:

In [20]:
m1$coefficients[1]+m1$coefficients[2]*90
(Intercept): 69.3685151719808
In [21]:
plot(overall ∼ rides, data=df,xlab="Satisfaction with Rides", ylab="Overall Satisfaction")
abline(m1, col="blue")

The result is shown in abline() recognizes that it is dealing with an lm object and uses the slope and the intercept from m1 to draw the line.

If you want more details about output

In [22]:
summary(m1)
Call:
lm(formula = overall ~ rides, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-33.926 -10.358   0.635   8.964  35.359 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -84.9969     8.7483  -9.716   <2e-16 ***
rides         1.7152     0.1086  15.795   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.14 on 498 degrees of freedom
Multiple R-squared:  0.3338,	Adjusted R-squared:  0.3324 
F-statistic: 249.5 on 1 and 498 DF,  p-value: < 2.2e-16
In [23]:
t_value <- 1.7146/ 0.1085  
print(t_value)
[1] 15.80276
In [24]:
p_value<-1-pt(t_value, df=498)
print(p_value)
[1] 0
In [25]:
alpha=0.05
conf_int_l<-1.7146  -qt(alpha/2,498)*0.1085  
conf_int_u<-1.7146  + qt(alpha/2,498)*0.1085  
round(conf_int_l,3)
round(conf_int_u,3)
1.928
1.501
In [26]:
round(confint(m1),3)
2.5 %97.5 %
(Intercept)-102.185-67.809
rides 1.502 1.929

Cecking Model Validity and Model Fit

In [27]:
par(mfrow=c(2,2))
plot(m1)
  1. The first plot (in the upper left corner) shows the fitted values versus residuals for $m1$. There is no obvious pattern between the fitted values for overall satisfaction and the residuals; this is consistent with the idea that the residuals are due to random error, and supports the notion that the model is adequate.
  2. The second plot in the lower left of figure is similar to the first, except that instead of plotting the raw residual value, it plots the square root of the standardized residual. Again, there should be no clear pattern; if there were it might indicate a nonlinear relationship. Observations with high residuals are flagged as potential outliers, and R labels them with row numbers in case we wish to inspect them in the data frame. A common pattern in residual plots is a cone or funnel, where the range of errors gets progressively larger for larger fitted values. This is called heteroskedasticity and is a violation of linear model assumptions. Sometimes a transformation of the predictor or outcome variable will resolve heteroskedasticity.
  3. The third result of plot() for lm objects is a Normal QQ plot. A QQ plot helps you see whether the residuals follow a normal distribution, another key assumption. It compares the values that residuals would be expected to take if they are normally distributed, versus their actual values. When the model is appropriate, these points are similar and fall close to a diagonal line; when the relationship between the variables is nonlinear or otherwise does not match the assumption, the points deviate from the diagonal line. In the present case, the QQ plot suggests that the data fits the assumption of the model.
  4. The final plot in the lower right panel of figure again helps to identify potential outliers, observations that may come from a different distribution than the others. Outliers are a problem because, if they are far from other points, they unduly influence the fitted line. We do not want one or a very few observations to have a large effect on the coefficients. The lower right plot plots the leverage of each point, a measure of how much influence the point has on the model coefficients. When a point has a high residual and high leverage,bit indicates that the point has both a different pattern (residual) and undue influence (leverage). One measure of the leverage of a data point is Cook’s distance, an estimate of how much predicted (y) values would change if the model were re-estimated with that point eliminated from the data. If you have observations with high Cook’s distance, this chart would show dotted lines for the distances; in the present case, there are none. Still, in the lower right of figure, three points are automatically labeled with row numbers because they are potentially problematic outliers based on high standardized residual distance and leverage on the model.

Dealing with Outliers

I do not recommend routinely remove outliers yet I do recommend to inspect them and determine whether there is a problem with the data. Let inspect the identified points by selecting those rows:

In [28]:
df[c(57, 129, 295),]
weekendnum.childdistanceridesgameswaitcleanoveralllogdist
57yes 2 63.2924892 83 93 98 102 4.147767
129yes 0 11.8955070 72 54 74 6 2.476161
295no 0 11.7447493 78 67 89 46 2.463406

In this case, none of the data points is obviously invalid (for instance, with values below 1 or greater than 100). Overall, figure looks good and suggests that the model relating overall satisfaction to satisfaction with rides is reasonable.

Now that we’ve covered the basics of linear models using just one predictor, we turn to the problem of assessing multiple drivers of satisfaction. Our goal is to sort through all of the features of the park—rides, games, wait times, and cleanliness—to determine which ones are most closely related to overall satisfaction.

Including Variables

In [29]:
m2 <- lm(overall ∼ rides + games + wait + clean, data=df)
summary(m2)
Call:
lm(formula = overall ~ rides + games + wait + clean, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-32.101  -7.266   1.274   7.582  28.695 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -125.16246    8.26476 -15.144  < 2e-16 ***
rides          0.53948    0.14753   3.657 0.000283 ***
games          0.16085    0.07115   2.261 0.024218 *  
wait           0.55497    0.04936  11.244  < 2e-16 ***
clean          0.96028    0.16394   5.858 8.57e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.92 on 495 degrees of freedom
Multiple R-squared:  0.5427,	Adjusted R-squared:  0.539 
F-statistic: 146.8 on 4 and 495 DF,  p-value: < 2.2e-16

The R-squared increased to 0.5586, meaning that about half of the variation in overall ratings is explained by the ratings for specific features. The residual standard error is now 10.59, meaning that the predictions are more accurate. Our residuals also appear to be symmetric.

In [30]:
par(mfrow=c(2,2))
plot(m2)

Next we examine the model coefficients. Each coefficient represents the strength of the relationship between satisfaction with that feature and overall satisfaction, conditional on the values of the other predictors. All four features are identified as being statistically significant.

In [31]:
coefplot(m2, intercept=FALSE, outerCI=1.96, lwdOuter=1.5, ylab="Rating of Feature",xlab="Association with Overall Satisfaction")
Warning message:
"Ignoring unknown aesthetics: xmin, xmax"

We use coefplot() arguments to set the outer confidence interval to a width of 1.96 standard errors (using outerCI=1.96, which corresponds to a 95% confidence interval) and to increase the size of the plotted lines slightly with lwdOuter=1.5.
The result is shown in Figure, where we see that satisfaction with cleanliness is estimated to be the most important feature associated with overall satisfaction, followed by satisfaction with the rides and wait times. Satisfaction with games is estimated to be relatively less important.
A plot of coefficients is often a key output from a satisfaction drivers analysis. Sorting the plot so that the coefficients are in order based on their estimated coefficient may make it easier to quickly identify the features that are most closely related to overall satisfaction if you have a large number of predictors.

Now that we have two model objects, m1 and m2 we might ask which one is better. One way to evaluate models is to compare their R-squared values.

In [32]:
summary(m1)$adj.r.squared
summary(m2)$adj.r.squared
0.332430505598762
0.538966781918842

and AIC criterion, which is given by: $$AIC = -2\log L + k \times edf ,$$ where $L$ is likelihood function (in our case based on normal distribution) and $edf$ is the number of free parameters for regression model. In OLS it is equal number of independent variables.

In [33]:
extractAIC(m1,k=2)
extractAIC(m2,k=2)
  1. 2
  2. 2577.28834706684
  1. 5
  2. 2395.18049016794

To compare the predictions of the models visually, we plot the fitted versus actual values for each:

In [34]:
plot(df$overall, fitted(m1), col="red",xlim=c(0,100), ylim=c(0,100), xlab="Actual Overall Satisfaction", ylab="Fitted Overall Satisfaction")
points(df$overall, fitted(m2), col="blue")
legend("topleft", legend=c("model 1", "model 2"),col=c("red", "blue"), pch=1)

If the model fits the data perfectly, it would fall along a 45◦ line in this plot, but, of course, it is nearly impossible to fit customer satisfaction data perfectly. By comparing the red and the blue points in the resulting plot in figure, you can see that the blue cloud of points is more tightly clustered along a diagonal line, which shows that m2 explains more of the variation in the data than m1.

Comparing Nested Models

For a more formal test, which is possible because the models here are nested. We can use anova() function to determine whether m2 explains more of the variation than m1:

In [35]:
anova(m1, m2)
Res.DfRSSDfSum of SqFPr(>F)
498 85921.11 NA NA NA NA
495 58980.91 3 26940.2 75.36562 3.641831e-40

The low p-value indicates that the additional predictors in m2 significantly improve the fit of the model. If these two models were the only ones under consideration, we would interpret m2 instead of m1.

Thus far, we have interpreted raw coefficients in order to evaluate the contributions of ratings on the shared 100-point scale. However, if the variables have different scales, such as a survey where rides is rated on a 1–10 scale while cleanliness is rated 1–5 scale, then their coefficient values would not be directly comparable. In the present data, this occurs with the distance and logdist variables, which are not on a 100-point scale.

When you wish to compare coefficients, it can be helpful to standardize data on a common scale before fitting a model (and after transforming any variables to a more normal scale). The most common standardization converts values to zero-centered units of standard deviation. This subtracts a variable’s mean from each observation and then divides by the standard deviation (sd()). This could be done by:

In [36]:
df.std <- df[ , -3] # sat but remove distance
df.std[ , 3:8] <- scale(df.std[ , 3:8])
In [37]:
head(df.std)
weekendnum.childridesgameswaitcleanoveralllogdist
yes 0 0.2987971 -0.7561865 -0.87575676 0.1184417 -0.2408509 1.7886823
yes 2 0.2987971 -0.1408004 0.61236008-0.2711692 0.9410106 0.3226360
no 1 -0.2552148 0.1053541 0.05431626 0.1184417 0.5677911 1.1862757
yes 0 0.2987971 -0.7561865 -0.41072025 0.3132472 -0.9250865 0.2803106
no 4 -0.4398855 1.0899720 0.42634547-0.2711692 1.0032138 1.0385034
no 5 -0.8092268 0.1053541 -2.08485169-1.6348075 -1.4849157 0.1452467

Model 3

For the next step, we wonder whether satisfaction is different for customers who come on the weekend, travel farther, or have more children.We add these predictors to the model using the standardized data:

In [38]:
m3 <- lm(overall ∼ rides + games + wait + clean + weekend + logdist + num.child, data = df.std)
summary(m3)
Call:
lm(formula = overall ~ rides + games + wait + clean + weekend + 
    logdist + num.child, data = df.std)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.60258 -0.39978  0.02281  0.41053  1.70036 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.39447    0.04675  -8.438 3.63e-16 ***
rides        0.20641    0.04210   4.903 1.29e-06 ***
games        0.07134    0.03041   2.346   0.0194 *  
wait         0.37871    0.02792  13.566  < 2e-16 ***
clean        0.29033    0.04431   6.553 1.43e-10 ***
weekendyes  -0.04370    0.05167  -0.846   0.3981    
logdist      0.06557    0.02585   2.537   0.0115 *  
num.child    0.23909    0.01720  13.904  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5739 on 492 degrees of freedom
Multiple R-squared:  0.6752,	Adjusted R-squared:  0.6706 
F-statistic: 146.1 on 7 and 492 DF,  p-value: < 2.2e-16

When your data includes factors, you must be careful about the data type. For example, num.child is a numeric variable, ranging 0–5, but it doesn’t necessarily make sense to treat it as a number, as we did in m3.

In doing so, we implicitly assume that satisfaction goes up or down linearly as a function of the number of children, and that the effect is the same for each additional child. (Anyone who has taken a group of children to an amusement park might guess that this is an unreasonable assumption.) We correct this by converting num.child to a factor and re-estimating the model:

In [39]:
df.std$num.child.factor <- factor(df.std$num.child)
In [40]:
m4 <- lm(overall ∼ rides + games + wait + clean + weekend + logdist + num.child.factor, data=df.std)
summary(m4)
Call:
lm(formula = overall ~ rides + games + wait + clean + weekend + 
    logdist + num.child.factor, data = df.std)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.25150 -0.32994 -0.00513  0.31409  1.44218 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -0.72670    0.04437 -16.378  < 2e-16 ***
rides              0.22188    0.03496   6.347 5.01e-10 ***
games              0.04299    0.02521   1.705   0.0888 .  
wait               0.38241    0.02312  16.537  < 2e-16 ***
clean              0.29755    0.03675   8.097 4.52e-15 ***
weekendyes        -0.01880    0.04273  -0.440   0.6602    
logdist            0.03154    0.02148   1.468   0.1426    
num.child.factor1  1.06124    0.07049  15.056  < 2e-16 ***
num.child.factor2  1.08519    0.05575  19.465  < 2e-16 ***
num.child.factor3  1.03091    0.06946  14.841  < 2e-16 ***
num.child.factor4  0.98893    0.07943  12.450  < 2e-16 ***
num.child.factor5  1.04293    0.10253  10.172  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4742 on 488 degrees of freedom
Multiple R-squared:  0.7801,	Adjusted R-squared:  0.7751 
F-statistic: 157.4 on 11 and 488 DF,  p-value: < 2.2e-16

A striking thing about m4 is that the increase in overall satisfaction is about the same regardless of how many children there are in the party—about one standard deviation higher for any number of children. This suggests that we don’t actually need to estimate a different increase for each number of children. In fact, if the increase is the same for one child as for five children, attempting to fit a model that scales increasingly per child would result in a less accurate estimate.

Instead, we declare a new variable called has.child that is TRUE when the party has children in it and FALSE when the party does not have children. We then estimate the model using that new factor variable.

We also drop weekend from the model because it doesn’t seem to be a significant predictor:

In [41]:
df.std$has.child <- factor(df.std$num.child > 0)
m5 <- lm(overall ∼ rides + games + wait + clean + logdist + has.child,data=df.std)
summary(m5)
Call:
lm(formula = overall ~ rides + games + wait + clean + logdist + 
    has.child, data = df.std)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.26183 -0.33269 -0.00197  0.32233  1.43605 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -0.73590    0.03861 -19.058  < 2e-16 ***
rides          0.22082    0.03466   6.370 4.33e-10 ***
games          0.04424    0.02509   1.763   0.0785 .  
wait           0.38335    0.02299  16.674  < 2e-16 ***
clean          0.29785    0.03648   8.164 2.73e-15 ***
logdist        0.03443    0.02124   1.621   0.1056    
has.childTRUE  1.05429    0.04629  22.774  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4727 on 493 degrees of freedom
Multiple R-squared:  0.7792,	Adjusted R-squared:  0.7765 
F-statistic:   290 on 6 and 493 DF,  p-value: < 2.2e-16

We can include an interaction of two terms by using the : operator between variables in a formula.

For instance, to estimate overall as a function of rides plus the interaction of wait and has.child, we could write the formula as overall ∼ rides + wait:no.child.

In [42]:
m6 <- lm(overall ∼ rides + games + wait + clean + weekend + logdist + has.child + rides:has.child + 
         games:has.child + wait:has.child + clean:has.child + rides:weekend + games:weekend 
+ wait:weekend + clean:weekend, data=df.std)
In [43]:
summary(m6)
Call:
lm(formula = overall ~ rides + games + wait + clean + weekend + 
    logdist + has.child + rides:has.child + games:has.child + 
    wait:has.child + clean:has.child + rides:weekend + games:weekend + 
    wait:weekend + clean:weekend, data = df.std)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.20612 -0.29683  0.00697  0.30788  1.45325 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -0.714619   0.042563 -16.790  < 2e-16 ***
rides                0.146551   0.067344   2.176 0.030026 *  
games                0.073642   0.048821   1.508 0.132103    
wait                 0.132772   0.043923   3.023 0.002637 ** 
clean                0.304566   0.078835   3.863 0.000127 ***
weekendyes          -0.015968   0.040826  -0.391 0.695880    
logdist              0.026443   0.020480   1.291 0.197261    
has.childTRUE        1.044456   0.044397  23.525  < 2e-16 ***
rides:has.childTRUE  0.054988   0.072208   0.762 0.446719    
games:has.childTRUE -0.057629   0.052324  -1.101 0.271275    
wait:has.childTRUE   0.341793   0.046910   7.286 1.31e-12 ***
clean:has.childTRUE  0.001811   0.078638   0.023 0.981640    
rides:weekendyes     0.072338   0.066973   1.080 0.280637    
games:weekendyes     0.021666   0.048577   0.446 0.655779    
wait:weekendyes      0.033464   0.044053   0.760 0.447846    
clean:weekendyes    -0.043026   0.070224  -0.613 0.540366    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4477 on 484 degrees of freedom
Multiple R-squared:  0.8056,	Adjusted R-squared:  0.7996 
F-statistic: 133.7 on 15 and 484 DF,  p-value: < 2.2e-16

The model object m6 now includes eight interaction terms between ratings for features of the park and no.child and weekend. Only one of these interactions is significant: the wait:no.child interaction.

This suggests we could drop the non-significant interactions to create a new model m7:

In [44]:
m7 <- lm(overall ∼ rides + games + wait + clean + logdist + has.child + wait:has.child, data=df.std)
summary(m7)
Call:
lm(formula = overall ~ rides + games + wait + clean + logdist + 
    has.child + wait:has.child, data = df.std)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.18731 -0.30473 -0.00076  0.32601  1.41957 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -0.72769    0.03647 -19.952  < 2e-16 ***
rides               0.21439    0.03274   6.549 1.46e-10 ***
games               0.04875    0.02370   2.057   0.0402 *  
wait                0.15182    0.03674   4.133 4.21e-05 ***
clean               0.28936    0.03446   8.396 4.96e-16 ***
logdist             0.02890    0.02006   1.440   0.1504    
has.childTRUE       1.04753    0.04372  23.962  < 2e-16 ***
wait:has.childTRUE  0.33970    0.04348   7.813 3.42e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4463 on 492 degrees of freedom
Multiple R-squared:  0.8036,	Adjusted R-squared:  0.8008 
F-statistic: 287.6 on 7 and 492 DF,  p-value: < 2.2e-16

In these results, we see that attending the park with children is a predictor of higher satisfaction, and waiting time is more important predictor among those with children (wait:has.childTRUE) than those without children. We don’t know the reason for this, but perhaps children go on more rides and their parents are therefore more influenced by wait times.

One might further tune the model by considering whether logdist is still needed; I’ll leave that you to verify that and assume that model m7 is the final model.

What to do with results?

What do we do with these results as marketers? We identify several possible marketing interventions.

  1. If we want to increase satisfaction overall, we could perhaps do so by trying to increase the number of visitors with children.
  2. Alternatively, if we want to appeal to visitors without children, we might engage in further research to understand why their ratings are lower.
  3. If we are allocating budget to personnel, the importance of cleanliness suggests continuing to allocate resources there (as opposed, say, to games).
  4. We might also want to learn more about the association between children and waiting time, and whether there are things we could do to make waiting less frequent or more enjoyable.

There are many more such questions one could pose from results like these; a crucial step in analysis is to think carefully about the implications and where one might be able to make a product or market intervention.

Important to remember that the model assesses association, not causation. Possible changes in outcome should be viewed as hypotheses suggested by the model, to be confirmed separately.

In [45]:
coefplot(m7, intercept=FALSE, outerCI=1.96, lwdOuter=1.5,ylab="Rating of Feature",xlab="Association with Overall Satisfaction")
Warning message:
"Ignoring unknown aesthetics: xmin, xmax"

Marginal Plot

Marginal Plots provide cool way to validate the functional form of the features.

In [46]:
mmps(m7,layout=c(2,3))
Warning message in mmps(m7, layout = c(2, 3)):
"Interactions and/or factors skipped"

Inverse Response Plot

In [47]:
par(mfrow=c(1,1))
inverse.response.plot(m1,key=TRUE)
Warning message:
"'inverse.response.plot' is deprecated.
Use 'inverseResponsePlot' instead.
See help("Deprecated") and help("alr3-deprecated")."
lambdaRSS
1.03740428676.85
-1.00000035249.13
0.00000029794.01
1.00000028677.74
In [48]:
#Model=1
n <- length(m1$residuals)
npar <- length(m1$coefficients) +1
#Calculate adjusted R_Square
R_sq_m1<-summary(m1)$adj.r.squared
#Calculate AICc
AIC_m1<-extractAIC(m1,k=2)+2*npar*(npar+1)/(n-npar-1)
#Calculate BIC
BIC_m1<-extractAIC(m1,k=log(n))
#Model=2
npar <- length(m2$coefficients) +1
#Calculate adjusted R_Square
R_sq_m2<-summary(m2)$adj.r.squared
#Calculate AICc
AIC_m2<-extractAIC(m2,k=2)+2*npar*(npar+1)/(n-npar-1)
#Calculate BIC
BIC_m2<-extractAIC(m2,k=log(n))
#Model=5
npar <- length(m5$coefficients) +1
#Calculate adjusted R_Square
R_sq_m5<-summary(m5)$adj.r.squared
#Calculate AICc
AIC_m5<-extractAIC(m5,k=2)+2*npar*(npar+1)/(n-npar-1)
#Calculate BIC
BIC_m5<-extractAIC(m5,k=log(n))
#Model=6
npar <- length(m6$coefficients) +1
#Calculate adjusted R_Square
R_sq_m6<-summary(m6)$adj.r.squared
#Calculate AICc
AIC_m6<-extractAIC(m6,k=2)+2*npar*(npar+1)/(n-npar-1)
#Calculate BIC
BIC_m6<-extractAIC(m6,k=log(n))
#Model=6
npar <- length(m7$coefficients) +1
#Calculate adjusted R_Square
R_sq_m7<-summary(m7)$adj.r.squared
#Calculate AICc
AIC_m7<-extractAIC(m7,k=2)+2*npar*(npar+1)/(n-npar-1)
#Calculate BIC
BIC_m7<-extractAIC(m7,k=log(n))
In [49]:
R_square<-cbind(R_sq_m1,R_sq_m2,R_sq_m5,R_sq_m6,R_sq_m7)
AIC<-cbind(AIC_m1[2],AIC_m2[2],AIC_m5[2],AIC_m6[2],AIC_m7[2])
BIC<-cbind(BIC_m1[2],BIC_m2[2],BIC_m5[2],BIC_m6[2],BIC_m7[2])
table<-rbind(R_square,AIC,BIC)
In [50]:
table
R_sq_m1R_sq_m2R_sq_m5R_sq_m6R_sq_m7
0.3324305 0.5389668 0.7765439 0.799596 0.8008016
2577.33673422395.3508756-742.0264129-786.701850 -798.4242080
2585.71756332416.2535307-712.8174352-720.537830 -765.0746901

Measuring the quality of fit

Suppose now you want to choose the model based on the measure how good is your model predicting. In the regression setting, the most commonly-use measure is the mean squared error(MSE), given by $$MSE = \frac{1}{n}\sum_{i=1}^n(y_i - \hat f(x_i))^2,$$ where $\hat f(x_i)$ is the prediction that $hat f$ gives for the $i$th observation. The MSE will be small if the predicted responses are very close to the true responses and vice versa.

Important If you compute estimate to previously seen data set (training data), then it should be comparably small. But in general, we do not really care how well the method works on the training data. Rather, we are interested in the accuracy of the predictions that we obtain when we apply out method to previously unseen test data. When your dataset is small and you do not have luxury to divide your data into two parts. One of the recommended method is $K-$fold crossvalidation

Cross Validation

In this case we randomly divide the set of observations into $k$ groups, or folds, of approximately equal size. Tje first fold is treated as a validation set, and the method is fit on the remaining $k-1$ folds. The $MSE_1$ is them computed on the observations in the held-out fold. This procedure is repeated $k$ times; each time, a different group of observations is treated as a validation set. This process results in $k$ estimates of the error, $MSE_1,MSE_2,\dots,MSE_k$. The $k-$fold CV is computed by averaging values, $$CV_{(k)}=\frac{1}{k}\sum_{i=1}^k MSE_i$$

In [51]:
dim(df)
MSE_m5<-deviance(m5)/nobs(m5)
print(MSE_m5)
MSE_m6<-deviance(m6)/nobs(m6)
print(MSE_m6)
MSE_m7<-deviance(m7)/nobs(m7)
print(MSE_m7)
  1. 500
  2. 9
[1] 0.2203277
[1] 0.1939911
[1] 0.1960113
In [68]:
cvlm<-function(n_folds=5,model,df)
    {
    data<-df
    K<-n_folds
    model<-model
    form<-formula(model)
    formtxt <- deparse(form)
    mf <- model.frame(form, data = data)
    ynam <- attr(mf, "names")[attr(attr(mf, "terms"), "response")]
    n_fold<-5
    sum.sq<-numeric(n_fold)
    fold_i <-rep(1:n_fold,length.out=dim(data)[1])
    for ( i in 1:n_fold)
    {
      test_i <-which(fold_i==i)
      n<-length(test_i)[1]  
      test<-data[test_i,]
      train<-data[-test_i,]
      test_x<-mf[test_i,]
      model_i<-lm(form,data=train)
      sum.sq[i]<-sum((test$overall-predict(model_i,newdata=test_x))^2)/n
     }
    SSE<-sum.sq
    MSE<-sum(sum.sq)/K
    return<-list("MSE"=MSE,"SSE"=SSE)
    return(return)
    }
In [69]:
MSE.m5<-cvlm(5,m5,df.std)$MSE
MSE.m6<-cvlm(5,m6,df.std)$MSE
MSE.m7<-cvlm(5,m7,df.std)$MSE
c(MSE.m5,MSE.m6,MSE.m7)
  1. 0.226783083734997
  2. 0.207219794257526
  3. 0.202444613178952
In [70]:
SSE.m5<-cvlm(5,m5,df.std)$SSE
SSE.m6<-cvlm(5,m6,df.std)$SSE
SSE.m7<-cvlm(5,m7,df.std)$SSE
In [72]:
boxplot(SSE.m5,SSE.m6,SSE.m7,names=c("M5","M6","M7"),col=c("blue","green","purple"))

We followed a lengthy process to arrive at the final model m7, and it is helpful to recount the general steps we recommend in creating such a linear model.

1. Inspect the data to make sure it is clean and has the structure you expect.
2. Check the distributions of the variables to make sure they are not highly skewed
 If one is skewed, consider transforming it.
3. Examine the bivariate scatterplots and correlation matrix to see
whether there are any extremely correlated variables (such as r > 0.9, or several
with r > 0.8). If so, omit some variables or consider transforming them if
needed.
4. If you wish to estimate coefficients on a consistent scale, standardize the data
with scale() (Sect. 7.3.3).
5. After fitting a model, check the residual quantiles in the output. The residuals
show how well the model accounts for the individual observations (Sect. 7.2.4).
6. Check the standard model plots using plot(), which will help you judge
whether a linear model is appropriate or whether there is nonlinearity, and will
identify potential outliers in the data (Sect. 7.2.4).
7. Try several models and compare them for overall interpretability and model fit
by inspecting the residuals’ spread and overall R2 (Sect. 7.3.1). If the models
are nested, you could also use anova() for comparison (Sect. 6.5.1) .
8. Report the confidence intervals of the estimates with your interpretation and
recommendations (Sect. 7.3).