Conjoint Analysis

In [1]:
install.packages("mlogit")
package 'mlogit' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\armop\AppData\Local\Temp\RtmpOUji0k\downloaded_packages
In [2]:
library("mlogit")
Loading required package: Formula
Loading required package: maxLik
Loading required package: miscTools

Please cite the 'maxLik' package as:
Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.

If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
https://r-forge.r-project.org/projects/maxlik/
In [3]:
getwd()
'C:/Users/armop/Dropbox/PHD/Teaching/AGBU505/2019/Supervised Learning/Lecture4'
In [4]:
setwd()
Error in setwd(): argument "dir" is missing, with no default
Traceback:

1. setwd()
In [5]:
cbc.df <- read.csv("C:\\Users\\armop\\Dropbox\\PHD\\Teaching\\AGBU505\\2019\\Homework 5\\conjoint_yogurt.csv", colClasses = c(flav="factor",size="factor",diet = "factor", price = "factor"))
In [6]:
head(cbc.df)
resp.idquesaltdietsizeshpflavpricechoice
1 1 1 no 6 crclcher90 0
1 1 2 no 6 crclvan 90 0
1 1 3 no 5 sqr van 40 1
1 2 1 no 8 crclpeac60 0
1 2 2 no 5 crclpeac90 0
1 2 3 no 5 crclcher40 1

The first three rows in cbc.df describe the first question that was asked of respondent 1, which is the question shown in Figure in Lecture Notes.

  1. The choice column shows that this respondent chose the third alternative, which was a 6-passenger gas engine minivan with 3 ft of cargo capacity at a price of $30,000$ (represented in $1,000$s as “30”).

  2. resp.id indicates which respondent answered this question,

  3. ques indicates that these first three rows were the profiles in the first question
  4. alt indicates that the first row was alternative 1, the second was alternative 2, and the third was alternative 3.
  5. choice indicates which alternative the respondent chose; it takes the value of 1 for the profile in each choice question that was indicated as the preferred alternative.
In [11]:
summary(cbc.df)
    resp.id            ques         alt    carpool    seat     cargo     
 Min.   :  1.00   Min.   : 1   Min.   :1   no :6345   6:3024   2ft:4501  
 1st Qu.: 50.75   1st Qu.: 4   1st Qu.:1   yes:2655   7:2993   3ft:4499  
 Median :100.50   Median : 8   Median :2              8:2983             
 Mean   :100.50   Mean   : 8   Mean   :2                                 
 3rd Qu.:150.25   3rd Qu.:12   3rd Qu.:3                                 
 Max.   :200.00   Max.   :15   Max.   :3                                 
   eng       price         choice      
 elec:3010   30:2998   Min.   :0.0000  
 gas :3005   35:2997   1st Qu.:0.0000  
 hyb :2985   40:3005   Median :0.0000  
                       Mean   :0.3333  
                       3rd Qu.:1.0000  
                       Max.   :1.0000  

However, a more informative way to summarize choice data is to compute choice counts, which are cross tabs on the number of times respondents chose an alternative at each feature level. We can do this easily using xtabs()

In [12]:
xtabs(choice ∼ price, data=cbc.df)
price
  30   35   40 
1486  956  558 
In [13]:
xtabs(choice ∼ cargo, data=cbc.df)
cargo
 2ft  3ft 
1312 1688 

You should compute choice counts for each attribute before estimating a choice model. If you find that your model’s estimates or predicted shares are not consistent with the raw counts, consider whether there could be a mistake in the data formatting.

Fitting model using mlogit

mlogit requires the choice data to be in a special data format created using the mlogit.data() function. You pass your choice data to mlogit.data, along with a few parameters telling it how the data is organized. mlogit.data accepts data in either a “long” or a “wide” format and you tell it which you have using the shape parameter. The choice, varying and id.var parameters indicate which columns contain the response data, the attributes and the respondent ids, respectively.

In [7]:
cbc.mlogit <- mlogit.data(data=cbc.df, choice="choice", shape="long",varying=3:6, alt.levels=paste("pos",1:3),id.var="resp.id")
In [8]:
head(cbc.mlogit)
resp.idquesaltdietsizeshpflavpricechoice
1.pos 11 1 1 no 6 crcl cher 90 FALSE
1.pos 21 1 2 no 6 crcl van 90 FALSE
1.pos 31 1 3 no 5 sqr van 40 TRUE
2.pos 11 2 1 no 8 crcl peac 60 FALSE
2.pos 21 2 2 no 5 crcl peac 90 FALSE
2.pos 31 2 3 no 5 crcl cher 40 TRUE
In [18]:
m1 <- mlogit(choice ∼ 0 + size + shp +price, data = cbc.mlogit)
In [11]:
summary(m1)
Call:
mlogit(formula = choice ~ 0 + size + shp + flav + price, data = cbc.mlogit, 
    method = "nr", print.level = 0)

Frequencies of alternatives:
pos 1 pos 2 pos 3 
0.325 0.347 0.328 

nr method
5 iterations, 0h:0m:0s 
g'(-H)^-1g = 5.1E-05 
successive function values within tolerance limits 

Coefficients :
          Estimate Std. Error  t-value  Pr(>|t|)    
size6    -0.395743   0.062733  -6.3084  2.82e-10 ***
size8    -0.153658   0.061339  -2.5051   0.01224 *  
shpsqr   -0.520698   0.050971 -10.2156 < 2.2e-16 ***
flavpeac  0.784948   0.066954  11.7237 < 2.2e-16 ***
flavvan   1.601449   0.068335  23.4354 < 2.2e-16 ***
price60  -0.796060   0.059407 -13.4001 < 2.2e-16 ***
price90  -1.621879   0.067511 -24.0239 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood: -2597.6

The Estimate lists the mean values for each level; these must be interpreted relative to the base levels of each attribute. For example, the estimate for seat7 measures the attractiveness of 7 passenger minivans relative to 6 passenger minivans. The negative sign tells us that, on average, our customers preferred 6 seat minivans to 7 seat minivans. Estimates that are larger in magnitude indicate stronger preferences, so we can see that customers strongly disliked electric engines (relative to the base level, which is gas) and disliked the $40K$ price (relative to the base level price of $30K$). These parameter estimates are on the logit scale and typically range between −2 and 2.

To convert into odds ratio we need to take exponetial of the coefficeints

In [18]:
round(exp(coef(m1)),3)
seat7
0.586
seat8
0.737
cargo3ft
1.612
enggas
4.622
enghyb
2.053
price35
0.401
price40
0.178

Based on the odds interpratation, the coeefient of $cargofit=1.612$ means that customers are 1.612 more likely to use miniven when cargo has 3ft compare to 2ft. Another way to think about this is that the 3ft cargo increases the likelihood by $61.2 \%$.

The Std. Error column gives a sense of how precise the estimate is, given the data, along with a statistical test of whether the coefficient is different than zero. A non-significant test result indicates that there is no detectible difference in preference for that level relative to the base level. Just as with any statistical model, the more data you have in you conjoint study (for a given set of attributes), the smaller the standard errors will be.

The good question is why we included $0 + $ in the formula for $m1$. It indicates that we did not want an intercept included in our model. We could estimate a model with an intercept:

In [19]:
m2 <- mlogit(choice ∼ seat + cargo + eng + price, data = cbc.mlogit)
In [20]:
summary(m2)
Call:
mlogit(formula = choice ~ seat + cargo + eng + price, data = cbc.mlogit, 
    method = "nr", print.level = 0)

Frequencies of alternatives:
  pos 1   pos 2   pos 3 
0.32700 0.33467 0.33833 

nr method
5 iterations, 0h:0m:0s 
g'(-H)^-1g = 7.86E-05 
successive function values within tolerance limits 

Coefficients :
                   Estimate Std. Error  t-value  Pr(>|t|)    
pos 2:(intercept)  0.028980   0.051277   0.5652    0.5720    
pos 3:(intercept)  0.041271   0.051384   0.8032    0.4219    
seat7             -0.535369   0.062369  -8.5840 < 2.2e-16 ***
seat8             -0.304369   0.061164  -4.9763 6.481e-07 ***
cargo3ft           0.477705   0.050899   9.3854 < 2.2e-16 ***
enggas             1.529423   0.067471  22.6677 < 2.2e-16 ***
enghyb             0.717929   0.065554  10.9517 < 2.2e-16 ***
price35           -0.913777   0.060608 -15.0769 < 2.2e-16 ***
price40           -1.726878   0.069654 -24.7922 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood: -2581.3
McFadden R^2:  0.21674 
Likelihood ratio test : chisq = 1428.5 (p.value = < 2.22e-16)

When we include the intercept, mlogit adds two additional parameters that indicate preference for the different positions in the question (left, right, or middle in survey figure from lecture notes):$\\ $

pos2:(intercept) indicates the relative preference of the second position in the question (versus the first) and pos3:(intercept) indicates the preference for the third position (versus the first.) These are sometimes called alternative specific constants or ASC’s to differentiate them from the single intercept in a linear model.

In a typical conjoint analysis study, we don’t expect that people will choose a minivan because it is on the left or the right in a survey question! For that reason, we would not expect the estimated alternative specific constants to differ from zero. If we found one of these parameters to be significant, that might indicate that some respondents are simply choosing the first or the last option without considering the question.

In this model, the intercept parameter estimates are non-significant and close to zero. This suggests that it was reasonable to leave them out of our first model, but we can test this formally using lrtest():

In [21]:
lrtest(m1, m2)
#DfLogLikDfChisqPr(>Chisq)
7 -2581.602NA NA NA
9 -2581.263 2 0.67891010.7121583

This function performs a statistical test called a likelihood ratio test, which can be used to compare two choice models where one model has a subset of the parameters of another model.$\\$ Comparing m1 to m2 results in a p-value Pr(>Chisq)) of $0.7122$. Since the p-value is much greater than $0.05$, we can conclude that m1 and m2 fit the data equally well. This suggests that we don’t need the alternative specific constants to fit the present data.

We don’t have to treat every attribute in a conjoint study as a factor. As with linear models, some predictors may be factors while others are numeric. For example, we can include price as a numeric predictor with a simple change to the model formula.

In the model formula, we convert price to character vector using as.character and then to a number using as.numeric.

In [26]:
m3 <- mlogit(choice ∼ diet +  as.numeric(as.character(price)), data = cbc.mlogit)
Error in solve.default(H, g[!fixed]): Lapack routine dgesv: system is exactly singular: U[3,3] = 0
Traceback:

1. mlogit(choice ~ diet + as.numeric(as.character(price)), data = cbc.mlogit)
2. eval(opt, sys.frame(which = nframe))
3. eval(opt, sys.frame(which = nframe))
4. mlogit.optim(method = "nr", print.level = 0, start = structure(c(0, 
 . 0, 0, 0), .Names = c("pos 2:(intercept)", "pos 3:(intercept)", 
 . "dietyes", "as.numeric(as.character(price))")), logLik = lnl.slogit, 
 .     weights = weights, opposite = opposite, X = Xl, y = yl)
5. as.vector(solve(H, g[!fixed]))
6. solve(H, g[!fixed])
7. solve.default(H, g[!fixed])
In [27]:
table(cbc.mlogit$diet)
  no  yes 
5490 3510 
In [23]:
cbc.mlogit$diet=as.factor(cbc.mlogit$diet)
In [24]:
levels(cbc.mlogit$diet)
  1. 'no'
  2. 'yes'

The output now shows a single parameter for price. The estimate is negative indicating that people prefer lower prices to higher prices. A quick likelihood ratio test suggests that the model with a single price parameter fits just as well as our first model.

In [24]:
lrtest(m1, m3)
#DfLogLikDfChisqPr(>Chisq)
7 -2581.602NA NA NA
6 -2582.055-1 0.905363 0.3413478

Given this finding, we choose $m3$ as our preferred model because it has fewer parameters.

Reporting Choice Model Findings

Since the coefficients measure relative preference for the levels, it makes them difficult to understand and interpret. So, instead of presenting the coefficients, most choice modelers prefer to focus on using the model to make choice share predictions or to compute willingness-to-pay for each attribute.

Willingness-to-Pay

We can compute the average willingness-to-pay for a particular level of an attribute by dividing the coefficient for that level by the price coefficient.

In [25]:
coef(m3)["cargo3ft"]/(-coef(m3)["as.numeric(as.character(price))"]/1000)
cargo3ft: 2750.60110180747

The result is a number measured in dollars, $\$2750.60$ in this case. (We divide by $1000$ because our prices were recorded in $1,000$s of dollars.)


Willingness-to-pay is a bit of a misnomer; the proper interpretation of this number is that, on average, customers would be equally divided between a minivan with 2 ft of cargo space and a minivan with 3 ft of cargo space that costs $\$2750.60$ more. Another way to think of it is that $\$2750.60$ is the price at which customers become indifferent between the two cargo capacity options.
You can compute willingness to pay value for every attribute in the study and reported to decision makers to help them understand how much customers value various features.

Choice Share predictors (Optional)

The idea is to use the above model to make share predictions. A share simulator allows you to define a number of different alternatives and then use the model to predict how customers would choose among those new alternatives. For example, you could use the model to predict choice share for the company’s new minivan design against a set of key competitors. By varying the attributes of the planned minivan design, you can see how changes in the design affect the choice share.

In [32]:
predict.mnl <- function(model, data) {
 # Function for predicting shares from a multinomial logit model
 # model: mlogit object returned by mlogit()
 # data: a data frame containing the set of designs for which you want to
 # predict shares. Same format as the data used to estimate model.
 data.model <- model.matrix(update(model$formula, 0.), data = data)[,-1]
 utility <- data.model%*%model$coef
 share <- exp(utility)/sum(exp(utility))
 cbind(share, data)
 }

Now we need to create new data for prediction.

In [33]:
attrib <- list(seat = c("6", "7", "8"),
cargo = c("2ft", "3ft"),
eng = c("gas", "hyb", "elec"),
 price = c("30", "35", "40"))
In [34]:
new.data <- expand.grid(attrib)[c(8, 1, 3, 41, 49, 26), ]
In [35]:
predict.mnl(m3, new.data)
shareseatcargoengprice
80.442787827 2ft hyb 30
10.163776926 2ft gas 30
30.120590188 2ft gas 30
410.027319677 3ft gas 40
490.059373236 2ft elec 40
260.186152167 2ft hyb 35

The model-predicted shares are shown in the column labeled share and we can see that among this set of products, we would expect respondents to choose the 7-seat hybrid engine minivan with 2 ft of cargo space at $\$30K$ a little more than $11\%$ of the time. If a company was planning to launch a minivan like this, they could use the model to see how changing the attributes of this product would affect the choice shares. Note that these share predictions are always made relative to a particular set of competitors; the share for the first minivan would change if the competitive set were different.

Thinks to Consider

  1. A crucial issue in planning a successful conjoint analysis study is to decide how any respondents should complete the survey. Sample size has significant affect on the model estimates and share predictions.
  2. We have focused on the multinomial logit model, which estimates a single set of part worth coefficients for a whole sample. Different people have different preferences, and models that estimate individual-level coefficients can fit data better and make more accurate predictions than sample-level models. (Heterogeneous models)
  3. While these share predictions are typically a good representation of how respondents would behave if they were asked to choose among these six minivans in a new survey, that predicted survey response might not translate directly to sales in the marketplace.