# Principal Component Analysis¶

In [2]:
setwd("C:\\Users\\armop\\Dropbox\\PHD\\Teaching\\AGBU505\\2019\\Unsupervised Learning\\Lecture 5") ##Start by defining directory

In [3]:
#install.packages("RColorBrewer")
library(corrplot)
library(gplots)
library(RColorBrewer)

Attaching package: 'gplots'

The following object is masked from 'package:stats':

lowess


In [4]:
head(brand.ratings)

2 4 88 2 9 7 4 6 a
1 1 47 1 1 1 2 2 a
2 3 59 2 9 5 1 6 a
1 6 108 3 4 5 2 1 a
1 1 58 1 9 9 1 1 a
2 8 95 3 8 7 1 2 a
In [5]:
tail(brand.ratings)

9954 2 8 71 3 3 5 2 j
9962 2 3 64 8 5 1 2 j
9973 2 6 71 3 3 2 1 j
9981 1 10101 6 5 5 2 j
9991 1 7 51 1 2 5 1 j
10007 4 7 84 1 2 5 1 j

Each of the $100$ respondents has observations on each of the $10$ brands, so there are $1,000$ total rows.

We inspect the $\textit{summary()}$ and $\textit{str()}$ to check the data quality and structure:

In [6]:
summary(brand.ratings)

    perform           leader           latest            fun
Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.000
1st Qu.: 1.000   1st Qu.: 2.000   1st Qu.: 4.000   1st Qu.: 4.000
Median : 4.000   Median : 4.000   Median : 7.000   Median : 6.000
Mean   : 4.488   Mean   : 4.417   Mean   : 6.195   Mean   : 6.068
3rd Qu.: 7.000   3rd Qu.: 6.000   3rd Qu.: 9.000   3rd Qu.: 8.000
Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.000

serious          bargain           value            trendy
Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.00
1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 3.00
Median : 4.000   Median : 4.000   Median : 4.000   Median : 5.00
Mean   : 4.323   Mean   : 4.259   Mean   : 4.337   Mean   : 5.22
3rd Qu.: 6.000   3rd Qu.: 6.000   3rd Qu.: 6.000   3rd Qu.: 7.00
Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00

Min.   : 1.000   a      :100
1st Qu.: 1.000   b      :100
Median : 3.000   c      :100
Mean   : 3.727   d      :100
3rd Qu.: 5.000   e      :100
Max.   :10.000   f      :100
(Other):400  
In [7]:
str(brand.ratings)

'data.frame':	1000 obs. of  10 variables:
$perform: int 2 1 2 1 1 2 1 2 2 3 ...$ leader : int  4 1 3 6 1 8 1 1 1 1 ...
$latest : int 8 4 5 10 5 9 5 7 8 9 ...$ fun    : int  8 7 9 8 8 5 7 5 10 8 ...
$serious: int 2 1 2 3 1 3 1 2 1 1 ...$ bargain: int  9 1 9 4 9 8 5 8 7 3 ...
$value : int 7 1 5 5 9 7 1 7 7 3 ...$ trendy : int  4 2 1 2 1 1 1 7 5 4 ...
$rebuy : int 6 2 6 1 1 2 1 1 1 1 ...$ brand  : Factor w/ 10 levels "a","b","c","d",..: 1 1 1 1 1 1 1 1 1 1 ...


We see in $\textit{summary()}$ that the ranges of the ratings for each adjective are $1–10$. In str(), we see that the ratings were read as numeric while the brand labels were properly interpreted as factors. In short, the data appear to be clean and formatted appropriately.

### Rescaling the Data¶

It is often good practice to rescale raw data. This makes data more comparable across individuals and samples. A common procedure is to center each variable by subtracting its mean from every observation, and then rescale those centered values as units of standard deviation. This is commonly called standardizing or normalizing.

In [8]:
brand.sc <- brand.ratings

In [9]:
brand.sc[, 1:9] <- scale(brand.ratings[, 1:9])

In [10]:
summary(brand.sc)

    perform            leader            latest             fun
Min.   :-1.0888   Min.   :-1.3100   Min.   :-1.6878   Min.   :-1.84677
1st Qu.:-1.0888   1st Qu.:-0.9266   1st Qu.:-0.7131   1st Qu.:-0.75358
Median :-0.1523   Median :-0.1599   Median : 0.2615   Median :-0.02478
Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000
3rd Qu.: 0.7842   3rd Qu.: 0.6069   3rd Qu.: 0.9113   3rd Qu.: 0.70402
Max.   : 1.7206   Max.   : 2.1404   Max.   : 1.2362   Max.   : 1.43281

serious           bargain             value             trendy
Min.   :-1.1961   Min.   :-1.22196   Min.   :-1.3912   Min.   :-1.53897
1st Qu.:-0.8362   1st Qu.:-0.84701   1st Qu.:-0.9743   1st Qu.:-0.80960
Median :-0.1163   Median :-0.09711   Median :-0.1405   Median :-0.08023
Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000
3rd Qu.: 0.6036   3rd Qu.: 0.65279   3rd Qu.: 0.6933   3rd Qu.: 0.64914
Max.   : 2.0434   Max.   : 2.15258   Max.   : 2.3610   Max.   : 1.74319

Min.   :-1.0717   a      :100
1st Qu.:-1.0717   b      :100
Median :-0.2857   c      :100
Mean   : 0.0000   d      :100
3rd Qu.: 0.5003   e      :100
Max.   : 2.4652   f      :100
(Other):400  

In this code we name the new data frame with extension “.sc” to remind ourselves that observations have been scaled. We operate on columns 1–9 because the 10th column is a factor variable for brand.

Note: Observations on the adjectives have a spread (difference between min and max) of roughly 3 standard deviation units. This means the distributions are platykurtic, flatter than a standard normal distribution, because we would expect a range of more than 4 standard deviation units for a sample of this size.

In [11]:
corrplot(cor(brand.sc[, 1:9]), order="hclust")


In corrplot(), the argument order="hclust" reorders the rows and columns according to variables’ similarity in a hierarchical cluster solution. We will cover clusters in next class.

This visualization of the basic data appears to show three general clusters that comprise fun/latest/trendy, rebuy/bargain/value, and perform/ leader/serious, respectively.

Perhaps the simplest business question that you may have is:

What is the average (mean) position of the brand on each feature?

We can use aggregate() to find the mean of each variable by brand:

In [12]:
brand.mean <- aggregate(.∼ brand, data=brand.sc, mean)
brand.mean

a -0.88591874-0.5279035 0.4109732 0.6566458 -0.91894067 0.21409609 0.18469264-0.52514473-0.59616642
b 0.93087022 1.0707584 0.7261069 -0.9722147 1.18314061 0.04161938 0.15133957 0.74030819 0.23697320
c 0.64992347 1.1627677 -0.1023372 -0.8446753 1.22273461-0.60704302-0.44067747 0.02552787-0.13243776
d -0.67989112-0.5930767 0.3524948 0.1865719 -0.69217505-0.88075605-0.93263529 0.73666135-0.49398892
e -0.56439079 0.1928362 0.4564564 0.2958914 0.04211361 0.55155051 0.41816415 0.13857986 0.03654811
f -0.05868665 0.2695106 -1.2621589 -0.2179102 0.58923066 0.87400696 1.02268859-0.81324496 1.35699580
g 0.91838369-0.1675336 -1.2849005 -0.5167168 -0.53379906 0.89650392 1.25616009-1.27639344 1.36092571
h -0.01498383-0.2978802 0.5019396 0.7149495 -0.14145855-0.73827529-0.78254646 0.86430070-0.60402622
i 0.33463879-0.3208825 0.3557436 0.4124989 -0.14865746-0.25459062-0.80339213 0.59078782-0.20317603
j -0.62994504-0.7885965 -0.1543180 0.2849595 -0.60218870-0.09711188-0.07379367-0.48138267-0.96164748

Before proceeding, we perform a bit of housekeeping on the new brand.mean object.

We name the rows with the brand labels that aggregate() put into the brand column, and then we remove that column as redundant:

In [13]:
rownames(brand.mean) <- brand.mean[, 1] # use brand for the row names
brand.mean <- brand.mean[, -1] # remove brand name column

In [14]:
brand.mean

a-0.88591874-0.5279035 0.4109732 0.6566458 -0.91894067 0.21409609 0.18469264-0.52514473-0.59616642
b 0.93087022 1.0707584 0.7261069 -0.9722147 1.18314061 0.04161938 0.15133957 0.74030819 0.23697320
c 0.64992347 1.1627677 -0.1023372 -0.8446753 1.22273461-0.60704302-0.44067747 0.02552787-0.13243776
d-0.67989112-0.5930767 0.3524948 0.1865719 -0.69217505-0.88075605-0.93263529 0.73666135-0.49398892
e-0.56439079 0.1928362 0.4564564 0.2958914 0.04211361 0.55155051 0.41816415 0.13857986 0.03654811
f-0.05868665 0.2695106 -1.2621589 -0.2179102 0.58923066 0.87400696 1.02268859-0.81324496 1.35699580
g 0.91838369-0.1675336 -1.2849005 -0.5167168 -0.53379906 0.89650392 1.25616009-1.27639344 1.36092571
h-0.01498383-0.2978802 0.5019396 0.7149495 -0.14145855-0.73827529-0.78254646 0.86430070-0.60402622
i 0.33463879-0.3208825 0.3557436 0.4124989 -0.14865746-0.25459062-0.80339213 0.59078782-0.20317603
j-0.62994504-0.7885965 -0.1543180 0.2849595 -0.60218870-0.09711188-0.07379367-0.48138267-0.96164748

A heatmap is a useful way to examine such results because it colors data points by the intensities of their values.

We use heatmap.2() from the gplots package with colors from the RColorBrewer package

In [15]:
heatmap.2(as.matrix(brand.mean),
col=brewer.pal(9, "GnBu"), trace="none", key=FALSE, dend="none",
main="\nBrand attributes")


In the code above, we coerce brand.mean to be a matrix as heatmap.2() expects.We color the map using greens and blues from RColorBrewer’s “GnBu” palette and turn off a few options that otherwise clutter the heatmap (trace, key, and dendrogram).We improve title alignment by adding blank lines with \n before the title text.

In this chart’s green-to-blue ("GnBu") palette a green color indicates a low value and dark blue indicates a high value; lighter colors are for values in the middle of the range.

The brands are clearly perceived differently with some brands rated high on performance and leadership (brands b and c) and others rated high for value and intention to rebuy (brands f and g).

By default, heatmap.2() sorts the columns and rows in order to emphasize similarities and patterns in the data, which is why the rows and columns in figure are ordered in an unexpected way. It does this using a form of hierarchical clustering

Looking at two figures above we could guess at the groupings and relationships of adjectives and brands.

For example, there is similarity in the color pattern across columns for the bargain/value/rebuy; a brand that is high on one tends to be high on another.

But it is better to formalize such insight, and the remainder of this chapter discusses how to do so.

### PCA for Brand Ratings¶

Let’s look at the principal components for the brand rating data. We find the components with prcomp(), selecting just the rating columns 1–9:

In [16]:
brand.pc <- prcomp(brand.sc[, 1:9])

In [17]:
summary(brand.pc)

Importance of components:
PC1    PC2    PC3    PC4     PC5     PC6     PC7
Standard deviation     1.726 1.4479 1.0389 0.8528 0.79846 0.73133 0.62458
Proportion of Variance 0.331 0.2329 0.1199 0.0808 0.07084 0.05943 0.04334
Cumulative Proportion  0.331 0.5640 0.6839 0.7647 0.83554 0.89497 0.93831
PC8     PC9
Standard deviation     0.55861 0.49310
Proportion of Variance 0.03467 0.02702
Cumulative Proportion  0.97298 1.00000

The default plot() for a PCA is a scree plot, which shows the successive proportion of additional variance that each component adds.

In [18]:
plot(brand.pc, type="l")


A scree plot is often interpreted as indicating where additional components are not worth the complexity; this occurs where the line has an elbow, a kink in the angle of bending, a somewhat subjective determination. In Figure, the elbow occurs at either component three or four, depending on interpretation; and this suggests that the first two or three components explain most of the variation in the observed brand ratings.

A biplot() of the first two principal components reveals how the rating adjectives are associated:

In [19]:
biplot(brand.pc)


We see the result in figure, where adjectives map in four regions:

3. trendiness (“trendy” and “latest”),

4. finally “fun” on its own.

There is a problem: the plot of individual respondents’ ratings is too dense and it does not tell us about the brand positions! A better solution is to perform PCA using aggregated ratings by brand.

In [20]:
brand.mean

a-0.88591874-0.5279035 0.4109732 0.6566458 -0.91894067 0.21409609 0.18469264-0.52514473-0.59616642
b 0.93087022 1.0707584 0.7261069 -0.9722147 1.18314061 0.04161938 0.15133957 0.74030819 0.23697320
c 0.64992347 1.1627677 -0.1023372 -0.8446753 1.22273461-0.60704302-0.44067747 0.02552787-0.13243776
d-0.67989112-0.5930767 0.3524948 0.1865719 -0.69217505-0.88075605-0.93263529 0.73666135-0.49398892
e-0.56439079 0.1928362 0.4564564 0.2958914 0.04211361 0.55155051 0.41816415 0.13857986 0.03654811
f-0.05868665 0.2695106 -1.2621589 -0.2179102 0.58923066 0.87400696 1.02268859-0.81324496 1.35699580
g 0.91838369-0.1675336 -1.2849005 -0.5167168 -0.53379906 0.89650392 1.25616009-1.27639344 1.36092571
h-0.01498383-0.2978802 0.5019396 0.7149495 -0.14145855-0.73827529-0.78254646 0.86430070-0.60402622
i 0.33463879-0.3208825 0.3557436 0.4124989 -0.14865746-0.25459062-0.80339213 0.59078782-0.20317603
j-0.62994504-0.7885965 -0.1543180 0.2849595 -0.60218870-0.09711188-0.07379367-0.48138267-0.96164748
In [21]:
brand.mu.pc <- prcomp(brand.mean, scale=TRUE)

In [22]:
summary(brand.mu.pc)

Importance of components:
PC1    PC2    PC3     PC4     PC5     PC6     PC7
Standard deviation     2.1345 1.7349 0.7690 0.61498 0.50983 0.36662 0.21506
Proportion of Variance 0.5062 0.3345 0.0657 0.04202 0.02888 0.01493 0.00514
Cumulative Proportion  0.5062 0.8407 0.9064 0.94842 0.97730 0.99223 0.99737
PC8     PC9
Standard deviation     0.14588 0.04867
Proportion of Variance 0.00236 0.00026
Cumulative Proportion  0.99974 1.00000

In the call to prcomp(), we added scale=TRUE in order to rescale the data; even though the raw data was already rescaled, the aggregated means have a somewhat different scale than the standardized data itself.

The results show that the first two components account for 84 % of the explainable variance in the mean ratings, so we focus on interpreting results with regard to them.

In [23]:
plot(brand.mu.pc,type="l")


### Perceptual Map¶

A biplot of the PCA solution for the mean ratings gives an interpretable perceptual map, showing where the brands are placed with respect to the first two principal components.

In [24]:
biplot(brand.mu.pc, main="Brand positioning", cex=c(1.5, 1))


Important Before interpreting the new map, we first check that using mean data did not greatly alter the structure. Figure above shows a different spatial rotation of the adjectives, compared to the prior figure, but the spatial position is arbitrary and the new map has the same overall grouping of adjectives and relational structure (for instance, seeing that “serious” and “leader” are closely related while “fun” is rather distant from other adjectives). Thus the variable positions on the components are consistent with PCA on the full set of observations, and we go ahead to interpret the graphic.

### Interpratation¶

What does the map tell us? First we interpret the adjective clusters and relationships and see four areas with well differentiated sets of adjectives and brands that are positioned in proximity.

Brands f and g are high on “value,” for instance, while a and j are relatively high on “fun,” which is opposite in direction from leadership adjectives (“leader” and “serious”).

With such a map, one might form questions and then refer to the underlying data to answer them. For instance, suppose that you are the brand manager for brand e.

What does the map tell you? For one thing, your brand is in the center and thus appears not to be well-differentiated on any of the dimensions. That could be good or bad, depending on your strategic goals. If your goal is to be a safe brand that appeals to many consumers, then a relatively undifferentiated position like e could be desirable. On the other hand, if you wish your brand to have a strong, differentiated perception, this finding would be unwanted (but important to know).

What should you do about the position of your brand e? Again, it depends on the strategic goals. If you wish to increase differentiation, one possibility would be to take action to shift your brand in some direction on the map. Suppose you wanted to move in the direction of brand c. You could look at the specific differences from c in the data:

In [25]:
brand.mean["c", ] - brand.mean["e", ]

c1.214314 0.9699315 -0.5587936-1.140567 1.180621 -1.158594 -0.8588416-0.113052 -0.1689859

This shows you that e is relatively stronger than c on “value” and “fun”, which suggests dialing down messaging or other attributes that reinforce those (assuming, of course, that you truly want to move in the direction of c). Similarly, c is stronger on “perform” and “serious,” so those could be aspects of the product or message for e to strengthen.

Another option would be not to follow another brand but to aim for differentiated space where no brand is positioned. In Figure above, there is a large gap between the group b and c on the bottom of the chart, versus f and g on the upper right. This area might be described as the “value leader” area or similar.

How do we find out how to position there? Let’s assume that the gap reflects approximately the average of those four brands . We can find that average using colMeans() on the brands’ rows, and then take the difference of e from that average:

In [26]:
colMeans(brand.mean[c("b", "c", "f", "g"), ]) - brand.mean["e", ]

e1.174513 0.3910396 -0.9372789-0.93377070.5732131 -0.25027870.07921355-0.46953040.6690661

This suggests that brand e could target the gap by increasing its emphasis on performance while reducing emphasis on “latest” and “fun.”

### Summary¶

To summarize, when you wish to compare several brands across many dimensions, it can be helpful to focus on just the first two or three principal components that explain variation in the data. You can select how many components to focus on using a scree plot, which shows how much variation in the data is explained by each principal component. A perceptual map plots the brands on the first two principal components, revealing how the observations relate to the underlying dimensions (the components).

### Cautions with Perceptual Maps¶

First, you must choose the level and type of aggregation carefully.We demonstrated the maps using mean rating by brand, but depending on the data and question at hand, it might be more suitable to use median (for ordinal data) or even modal response (for categorical data). You should check that the dimensions are similar for the full data and aggregated data before interpreting aggregate maps. You can do this by examining the variable positions and relationships in biplots of both aggregated data (such as means) and raw data (or a random subset of it), as we did above.

Second, the relationships are strictly relative to the product category and the brands and adjectives that are tested. In a different product category, or with different brands, adjectives such as “fun” and “leader” could have a very different relationship. Sometimes simply adding or dropping a brand can change the resulting map significantly because the positions are relative. In other words, if a new brand enters the market (or one’s analysis), the other positions may change substantially. One must also be confident that all of the key perceptions (adjectives, in this example) have been assessed. One way to assess sensitivity here is to run PCA and biplot on a few different samples from your data, such as 80 % of your observations, and perhaps dropping an adjective each time. If the maps are similar across those samples, you may feel more confident in their stability.

Third, it is frequently misunderstood that the positions of brands in such a map depend on their relative positioning in terms of the principal components, which are constructed composites of all dimensions. This means that the strength of a brand on a single adjective cannot be read directly from the chart. For instance, in Figure above, it might appear that brands b and c are weaker than d, h, and i on “latest” but are similar to one another. In fact, b is the single strongest brand on “latest” while c is weak on that adjective. **Overall, b and c are quite similar to one another in terms of their scores on the two components that aggregate all of the variables (adjectives), but they are not necessarily similar on any single variable. Another way to look at this is that when we use PCA to focus on the first one or two dimensions in the data, we are looking at the largest-magnitude similarities, which may obscure smaller differences that do not show up strongly in the first one or two dimensions.