In [61]:
rm(list=ls())

Predicting The Sales of Carseat

We first use classification trees to analyze the Carseats data set. The main goal is to predict the Sales of Carseats and find important features that influence the sales. In these data, Sales is a continuous variable, and so we begin by recoding it as a binary variable. We use the ifelse() function to create a variable, called ifelse() High, which takes on a value of Yes if the Sales variable exceeds 8, and takes on a value of No otherwise.

In [62]:
library (ISLR)
library(plotly)

Data

A data frame with 400 observations on the following 11 variables.

  • Sales- Unit sales (in thousands) at each location
  • CompPrice - Price charged by competitor at each location
  • Income - Community income level (in thousands of dollars)
  • Advertising - Local advertising budget for company at each location (in thousands of dollars)
  • Population - Population size in region (in thousands)
  • Price - Price company charges for car seats at each site
  • ShelveLoc - A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site
  • Age -Average age of the local population
  • Education -Education level at each location
  • Urban -A factor with levels No and Yes to indicate whether the store is in an urban or rural location
  • US -A factor with levels No and Yes to indicate whether the store is in the US or not
In [63]:
head(Carseats)
SalesCompPriceIncomeAdvertisingPopulationPriceShelveLocAgeEducationUrbanUS
9.50 138 73 11 276 120 Bad 42 17 Yes Yes
11.22 111 48 16 260 83 Good 65 10 Yes Yes
10.06 113 35 10 269 80 Medium59 12 Yes Yes
7.40 117 100 4 466 97 Medium55 14 Yes Yes
4.15 141 64 3 340 128 Bad 38 13 Yes No
10.81 124 113 13 501 72 Bad 78 16 No Yes
In [64]:
sum(is.na(Carseats))
0
In [65]:
summary(Carseats)
     Sales          CompPrice       Income        Advertising    
 Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
 1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
 Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
 Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
 3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
 Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
   Population        Price        ShelveLoc        Age          Education   
 Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00   Min.   :10.0  
 1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75   1st Qu.:12.0  
 Median :272.0   Median :117.0   Medium:219   Median :54.50   Median :14.0  
 Mean   :264.8   Mean   :115.8                Mean   :53.32   Mean   :13.9  
 3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00   3rd Qu.:16.0  
 Max.   :509.0   Max.   :191.0                Max.   :80.00   Max.   :18.0  
 Urban       US     
 No :118   No :142  
 Yes:282   Yes:258  
                    
                    
                    
                    

In these data, Sales is a continuous variable, and so we begin by recoding it as a binary variable. We use the ifelse() function to create a variable, called High, which takes on a value of Yes if the Sales variable exceeds 8, and takes on a value of No otherwise.

In [66]:
Carseats$Income
  1. 73
  2. 48
  3. 35
  4. 100
  5. 64
  6. 113
  7. 105
  8. 81
  9. 110
  10. 113
  11. 78
  12. 94
  13. 35
  14. 28
  15. 117
  16. 95
  17. 32
  18. 74
  19. 110
  20. 76
  21. 90
  22. 29
  23. 46
  24. 31
  25. 119
  26. 32
  27. 115
  28. 118
  29. 74
  30. 99
  31. 94
  32. 58
  33. 32
  34. 38
  35. 54
  36. 84
  37. 76
  38. 41
  39. 73
  40. 60
  41. 98
  42. 53
  43. 69
  44. 42
  45. 79
  46. 63
  47. 90
  48. 98
  49. 52
  50. 93
  51. 32
  52. 90
  53. 40
  54. 64
  55. 103
  56. 81
  57. 82
  58. 91
  59. 93
  60. 71
  61. 102
  62. 32
  63. 45
  64. 88
  65. 67
  66. 26
  67. 92
  68. 61
  69. 69
  70. 59
  71. 81
  72. 51
  73. 45
  74. 90
  75. 68
  76. 111
  77. 87
  78. 71
  79. 48
  80. 67
  81. 100
  82. 72
  83. 83
  84. 36
  85. 25
  86. 103
  87. 84
  88. 67
  89. 42
  90. 66
  91. 22
  92. 46
  93. 113
  94. 30
  95. 97
  96. 25
  97. 42
  98. 82
  99. 77
  100. 47
  101. 69
  102. 93
  103. 22
  104. 91
  105. 96
  106. 100
  107. 33
  108. 107
  109. 79
  110. 65
  111. 62
  112. 118
  113. 99
  114. 29
  115. 87
  116. 35
  117. 75
  118. 53
  119. 88
  120. 94
  121. 105
  122. 89
  123. 100
  124. 103
  125. 113
  126. 78
  127. 68
  128. 48
  129. 100
  130. 120
  131. 84
  132. 69
  133. 87
  134. 98
  135. 31
  136. 94
  137. 75
  138. 42
  139. 103
  140. 62
  141. 60
  142. 42
  143. 84
  144. 88
  145. 68
  146. 63
  147. 83
  148. 54
  149. 119
  150. 120
  151. 84
  152. 58
  153. 78
  154. 36
  155. 69
  156. 72
  157. 34
  158. 58
  159. 90
  160. 60
  161. 28
  162. 21
  163. 74
  164. 64
  165. 64
  166. 58
  167. 67
  168. 73
  169. 89
  170. 41
  171. 39
  172. 106
  173. 102
  174. 91
  175. 24
  176. 89
  177. 107
  178. 72
  179. 71
  180. 25
  181. 112
  182. 83
  183. 60
  184. 74
  185. 33
  186. 100
  187. 51
  188. 32
  189. 37
  190. 117
  191. 37
  192. 42
  193. 26
  194. 70
  195. 98
  196. 93
  197. 28
  198. 61
  199. 80
  200. 88
  201. 92
  202. 83
  203. 78
  204. 82
  205. 80
  206. 22
  207. 67
  208. 105
  209. 54
  210. 21
  211. 41
  212. 118
  213. 69
  214. 84
  215. 115
  216. 83
  217. 33
  218. 44
  219. 61
  220. 79
  221. 120
  222. 44
  223. 119
  224. 45
  225. 82
  226. 25
  227. 33
  228. 64
  229. 73
  230. 104
  231. 60
  232. 69
  233. 80
  234. 76
  235. 62
  236. 32
  237. 34
  238. 28
  239. 24
  240. 105
  241. 80
  242. 63
  243. 46
  244. 25
  245. 30
  246. 43
  247. 56
  248. 114
  249. 52
  250. 67
  251. 105
  252. 111
  253. 97
  254. 24
  255. 104
  256. 81
  257. 40
  258. 62
  259. 38
  260. 36
  261. 117
  262. 42
  263. 77
  264. 26
  265. 29
  266. 35
  267. 93
  268. 82
  269. 57
  270. 69
  271. 26
  272. 56
  273. 33
  274. 106
  275. 93
  276. 119
  277. 69
  278. 48
  279. 113
  280. 57
  281. 86
  282. 69
  283. 96
  284. 110
  285. 46
  286. 26
  287. 118
  288. 44
  289. 40
  290. 77
  291. 111
  292. 70
  293. 66
  294. 84
  295. 76
  296. 35
  297. 44
  298. 83
  299. 63
  300. 40
  301. 78
  302. 93
  303. 77
  304. 52
  305. 98
  306. 29
  307. 32
  308. 92
  309. 80
  310. 111
  311. 65
  312. 68
  313. 117
  314. 81
  315. 33
  316. 21
  317. 36
  318. 30
  319. 72
  320. 45
  321. 70
  322. 39
  323. 50
  324. 105
  325. 65
  326. 69
  327. 30
  328. 38
  329. 66
  330. 54
  331. 59
  332. 63
  333. 33
  334. 60
  335. 117
  336. 70
  337. 35
  338. 38
  339. 24
  340. 44
  341. 29
  342. 120
  343. 102
  344. 42
  345. 80
  346. 68
  347. 107
  348. 39
  349. 102
  350. 27
  351. 101
  352. 115
  353. 103
  354. 67
  355. 31
  356. 100
  357. 109
  358. 73
  359. 96
  360. 62
  361. 86
  362. 25
  363. 55
  364. 75
  365. 21
  366. 30
  367. 56
  368. 106
  369. 22
  370. 100
  371. 41
  372. 81
  373. 50
  374. 71
  375. 47
  376. 46
  377. 60
  378. 61
  379. 88
  380. 111
  381. 64
  382. 65
  383. 28
  384. 117
  385. 37
  386. 73
  387. 116
  388. 73
  389. 89
  390. 42
  391. 75
  392. 63
  393. 42
  394. 51
  395. 58
  396. 108
  397. 23
  398. 26
  399. 79
  400. 37
In [67]:
attach(Carseats)
The following objects are masked from Carseats (pos = 3):

    Advertising, Age, CompPrice, Education, Income, Population, Price,
    Sales, ShelveLoc, Urban, US

The following objects are masked from Carseats (pos = 5):

    Advertising, Age, CompPrice, Education, Income, Population, Price,
    Sales, ShelveLoc, Urban, US

In [68]:
Income
  1. 73
  2. 48
  3. 35
  4. 100
  5. 64
  6. 113
  7. 105
  8. 81
  9. 110
  10. 113
  11. 78
  12. 94
  13. 35
  14. 28
  15. 117
  16. 95
  17. 32
  18. 74
  19. 110
  20. 76
  21. 90
  22. 29
  23. 46
  24. 31
  25. 119
  26. 32
  27. 115
  28. 118
  29. 74
  30. 99
  31. 94
  32. 58
  33. 32
  34. 38
  35. 54
  36. 84
  37. 76
  38. 41
  39. 73
  40. 60
  41. 98
  42. 53
  43. 69
  44. 42
  45. 79
  46. 63
  47. 90
  48. 98
  49. 52
  50. 93
  51. 32
  52. 90
  53. 40
  54. 64
  55. 103
  56. 81
  57. 82
  58. 91
  59. 93
  60. 71
  61. 102
  62. 32
  63. 45
  64. 88
  65. 67
  66. 26
  67. 92
  68. 61
  69. 69
  70. 59
  71. 81
  72. 51
  73. 45
  74. 90
  75. 68
  76. 111
  77. 87
  78. 71
  79. 48
  80. 67
  81. 100
  82. 72
  83. 83
  84. 36
  85. 25
  86. 103
  87. 84
  88. 67
  89. 42
  90. 66
  91. 22
  92. 46
  93. 113
  94. 30
  95. 97
  96. 25
  97. 42
  98. 82
  99. 77
  100. 47
  101. 69
  102. 93
  103. 22
  104. 91
  105. 96
  106. 100
  107. 33
  108. 107
  109. 79
  110. 65
  111. 62
  112. 118
  113. 99
  114. 29
  115. 87
  116. 35
  117. 75
  118. 53
  119. 88
  120. 94
  121. 105
  122. 89
  123. 100
  124. 103
  125. 113
  126. 78
  127. 68
  128. 48
  129. 100
  130. 120
  131. 84
  132. 69
  133. 87
  134. 98
  135. 31
  136. 94
  137. 75
  138. 42
  139. 103
  140. 62
  141. 60
  142. 42
  143. 84
  144. 88
  145. 68
  146. 63
  147. 83
  148. 54
  149. 119
  150. 120
  151. 84
  152. 58
  153. 78
  154. 36
  155. 69
  156. 72
  157. 34
  158. 58
  159. 90
  160. 60
  161. 28
  162. 21
  163. 74
  164. 64
  165. 64
  166. 58
  167. 67
  168. 73
  169. 89
  170. 41
  171. 39
  172. 106
  173. 102
  174. 91
  175. 24
  176. 89
  177. 107
  178. 72
  179. 71
  180. 25
  181. 112
  182. 83
  183. 60
  184. 74
  185. 33
  186. 100
  187. 51
  188. 32
  189. 37
  190. 117
  191. 37
  192. 42
  193. 26
  194. 70
  195. 98
  196. 93
  197. 28
  198. 61
  199. 80
  200. 88
  201. 92
  202. 83
  203. 78
  204. 82
  205. 80
  206. 22
  207. 67
  208. 105
  209. 54
  210. 21
  211. 41
  212. 118
  213. 69
  214. 84
  215. 115
  216. 83
  217. 33
  218. 44
  219. 61
  220. 79
  221. 120
  222. 44
  223. 119
  224. 45
  225. 82
  226. 25
  227. 33
  228. 64
  229. 73
  230. 104
  231. 60
  232. 69
  233. 80
  234. 76
  235. 62
  236. 32
  237. 34
  238. 28
  239. 24
  240. 105
  241. 80
  242. 63
  243. 46
  244. 25
  245. 30
  246. 43
  247. 56
  248. 114
  249. 52
  250. 67
  251. 105
  252. 111
  253. 97
  254. 24
  255. 104
  256. 81
  257. 40
  258. 62
  259. 38
  260. 36
  261. 117
  262. 42
  263. 77
  264. 26
  265. 29
  266. 35
  267. 93
  268. 82
  269. 57
  270. 69
  271. 26
  272. 56
  273. 33
  274. 106
  275. 93
  276. 119
  277. 69
  278. 48
  279. 113
  280. 57
  281. 86
  282. 69
  283. 96
  284. 110
  285. 46
  286. 26
  287. 118
  288. 44
  289. 40
  290. 77
  291. 111
  292. 70
  293. 66
  294. 84
  295. 76
  296. 35
  297. 44
  298. 83
  299. 63
  300. 40
  301. 78
  302. 93
  303. 77
  304. 52
  305. 98
  306. 29
  307. 32
  308. 92
  309. 80
  310. 111
  311. 65
  312. 68
  313. 117
  314. 81
  315. 33
  316. 21
  317. 36
  318. 30
  319. 72
  320. 45
  321. 70
  322. 39
  323. 50
  324. 105
  325. 65
  326. 69
  327. 30
  328. 38
  329. 66
  330. 54
  331. 59
  332. 63
  333. 33
  334. 60
  335. 117
  336. 70
  337. 35
  338. 38
  339. 24
  340. 44
  341. 29
  342. 120
  343. 102
  344. 42
  345. 80
  346. 68
  347. 107
  348. 39
  349. 102
  350. 27
  351. 101
  352. 115
  353. 103
  354. 67
  355. 31
  356. 100
  357. 109
  358. 73
  359. 96
  360. 62
  361. 86
  362. 25
  363. 55
  364. 75
  365. 21
  366. 30
  367. 56
  368. 106
  369. 22
  370. 100
  371. 41
  372. 81
  373. 50
  374. 71
  375. 47
  376. 46
  377. 60
  378. 61
  379. 88
  380. 111
  381. 64
  382. 65
  383. 28
  384. 117
  385. 37
  386. 73
  387. 116
  388. 73
  389. 89
  390. 42
  391. 75
  392. 63
  393. 42
  394. 51
  395. 58
  396. 108
  397. 23
  398. 26
  399. 79
  400. 37
In [74]:
Carseats$High=ifelse(Sales <=8,0,1)
In [75]:
factor(High)
  1. 1
  2. 1
  3. 1
  4. 0
  5. 0
  6. 1
  7. 0
  8. 1
  9. 0
  10. 0
  11. 1
  12. 1
  13. 0
  14. 1
  15. 1
  16. 1
  17. 0
  18. 1
  19. 1
  20. 1
  21. 0
  22. 1
  23. 0
  24. 0
  25. 1
  26. 1
  27. 1
  28. 0
  29. 0
  30. 0
  31. 1
  32. 1
  33. 0
  34. 1
  35. 0
  36. 1
  37. 1
  38. 0
  39. 0
  40. 0
  41. 0
  42. 0
  43. 1
  44. 0
  45. 0
  46. 0
  47. 1
  48. 0
  49. 0
  50. 1
  51. 0
  52. 0
  53. 0
  54. 0
  55. 0
  56. 0
  57. 1
  58. 0
  59. 0
  60. 0
  61. 1
  62. 0
  63. 0
  64. 1
  65. 0
  66. 0
  67. 1
  68. 1
  69. 1
  70. 0
  71. 1
  72. 0
  73. 0
  74. 1
  75. 0
  76. 1
  77. 1
  78. 0
  79. 0
  80. 1
  81. 1
  82. 0
  83. 1
  84. 0
  85. 0
  86. 1
  87. 1
  88. 1
  89. 0
  90. 0
  91. 0
  92. 0
  93. 0
  94. 1
  95. 1
  96. 0
  97. 1
  98. 0
  99. 1
  100. 0
  101. 0
  102. 0
  103. 0
  104. 0
  105. 0
  106. 0
  107. 0
  108. 1
  109. 0
  110. 1
  111. 1
  112. 0
  113. 0
  114. 0
  115. 1
  116. 1
  117. 0
  118. 1
  119. 0
  120. 0
  121. 0
  122. 1
  123. 0
  124. 1
  125. 1
  126. 1
  127. 1
  128. 0
  129. 0
  130. 0
  131. 1
  132. 0
  133. 1
  134. 0
  135. 0
  136. 0
  137. 0
  138. 0
  139. 1
  140. 1
  141. 0
  142. 0
  143. 0
  144. 0
  145. 1
  146. 1
  147. 0
  148. 1
  149. 0
  150. 1
  151. 1
  152. 1
  153. 0
  154. 0
  155. 0
  156. 0
  157. 0
  158. 1
  159. 1
  160. 1
  161. 0
  162. 0
  163. 0
  164. 0
  165. 1
  166. 0
  167. 0
  168. 0
  169. 0
  170. 1
  171. 1
  172. 1
  173. 1
  174. 0
  175. 0
  176. 0
  177. 0
  178. 1
  179. 1
  180. 0
  181. 0
  182. 0
  183. 0
  184. 0
  185. 1
  186. 1
  187. 1
  188. 0
  189. 1
  190. 1
  191. 1
  192. 0
  193. 0
  194. 1
  195. 0
  196. 0
  197. 0
  198. 0
  199. 0
  200. 0
  201. 0
  202. 0
  203. 0
  204. 0
  205. 1
  206. 0
  207. 0
  208. 1
  209. 0
  210. 0
  211. 0
  212. 1
  213. 1
  214. 1
  215. 0
  216. 0
  217. 0
  218. 0
  219. 1
  220. 1
  221. 1
  222. 0
  223. 0
  224. 0
  225. 0
  226. 0
  227. 0
  228. 1
  229. 0
  230. 1
  231. 0
  232. 1
  233. 1
  234. 1
  235. 1
  236. 0
  237. 1
  238. 1
  239. 0
  240. 0
  241. 1
  242. 1
  243. 0
  244. 0
  245. 1
  246. 1
  247. 0
  248. 0
  249. 0
  250. 0
  251. 1
  252. 0
  253. 1
  254. 0
  255. 1
  256. 0
  257. 0
  258. 1
  259. 0
  260. 0
  261. 0
  262. 0
  263. 0
  264. 0
  265. 0
  266. 0
  267. 1
  268. 0
  269. 0
  270. 0
  271. 1
  272. 0
  273. 1
  274. 1
  275. 0
  276. 0
  277. 0
  278. 0
  279. 0
  280. 0
  281. 0
  282. 1
  283. 0
  284. 0
  285. 0
  286. 0
  287. 0
  288. 0
  289. 0
  290. 1
  291. 1
  292. 0
  293. 1
  294. 1
  295. 1
  296. 0
  297. 1
  298. 0
  299. 1
  300. 1
  301. 1
  302. 0
  303. 0
  304. 1
  305. 1
  306. 1
  307. 0
  308. 0
  309. 1
  310. 1
  311. 1
  312. 0
  313. 0
  314. 1
  315. 0
  316. 0
  317. 1
  318. 0
  319. 1
  320. 0
  321. 0
  322. 0
  323. 1
  324. 1
  325. 0
  326. 1
  327. 0
  328. 0
  329. 0
  330. 1
  331. 0
  332. 1
  333. 0
  334. 0
  335. 0
  336. 0
  337. 0
  338. 1
  339. 0
  340. 1
  341. 0
  342. 0
  343. 0
  344. 0
  345. 1
  346. 0
  347. 1
  348. 0
  349. 1
  350. 1
  351. 1
  352. 1
  353. 1
  354. 1
  355. 0
  356. 0
  357. 0
  358. 1
  359. 0
  360. 0
  361. 1
  362. 1
  363. 0
  364. 1
  365. 1
  366. 0
  367. 0
  368. 1
  369. 1
  370. 1
  371. 0
  372. 1
  373. 0
  374. 0
  375. 1
  376. 0
  377. 1
  378. 0
  379. 0
  380. 0
  381. 1
  382. 0
  383. 0
  384. 1
  385. 1
  386. 0
  387. 0
  388. 1
  389. 1
  390. 1
  391. 0
  392. 0
  393. 0
  394. 0
  395. 0
  396. 1
  397. 0
  398. 0
  399. 0
  400. 1

Finally, we use the data.frame() function to merge High with the rest of the Carseats data.

In [76]:
head(Carseats)
SalesCompPriceIncomeAdvertisingPopulationPriceShelveLocAgeEducationUrbanUSHigh
9.50 138 73 11 276 120 Bad 42 17 Yes Yes 1
11.22 111 48 16 260 83 Good 65 10 Yes Yes 1
10.06 113 35 10 269 80 Medium59 12 Yes Yes 1
7.40 117 100 4 466 97 Medium55 14 Yes Yes 0
4.15 141 64 3 340 128 Bad 38 13 Yes No 0
10.81 124 113 13 501 72 Bad 78 16 No Yes 1
In [77]:
plot(Carseats$Price,Carseats$Advertising,col=c("red","blue")[Carseats$High])
In [78]:
plot(Advertising,ShelveLoc,col=c("red","blue")[High])
In [79]:
plot_ly(Carseats, x = ~Price, y = ~Advertising, z = ~ShelveLoc, color = ~High, colors = c('#BF382A', '#0C4B8E')) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = ''),
                     yaxis = list(title = ''),
                     zaxis = list(title = '')))
ERROR while rich displaying an object: Error in Summary.factor(structure(c(1L, 2L, 3L, 3L, 1L, 1L, 3L, 2L, 3L, : 'range' not meaningful for factors

Traceback:
1. FUN(X[[i]], ...)
2. tryCatch(withCallingHandlers({
 .     rpr <- mime2repr[[mime]](obj)
 .     if (is.null(rpr)) 
 .         return(NULL)
 .     prepare_content(is.raw(rpr), rpr)
 . }, error = error_handler), error = outer_handler)
3. tryCatchList(expr, classes, parentenv, handlers)
4. tryCatchOne(expr, names, parentenv, handlers[[1L]])
5. doTryCatch(return(expr), name, parentenv, handler)
6. withCallingHandlers({
 .     rpr <- mime2repr[[mime]](obj)
 .     if (is.null(rpr)) 
 .         return(NULL)
 .     prepare_content(is.raw(rpr), rpr)
 . }, error = error_handler)
7. mime2repr[[mime]](obj)
8. repr_html.htmlwidget(obj)
9. htmlwidgets::saveWidget(obj, htmlfile)
10. toHTML(widget, standalone = TRUE, knitrOptions = knitrOptions)
11. htmltools::tagList(container(htmltools::tagList(x$prepend, widget_html(name = class(x)[1], 
  .     package = attr(x, "package"), id = id, style = style, class = paste(class(x)[1], 
  .         "html-widget"), width = sizeInfo$width, height = sizeInfo$height), 
  .     x$append)), widget_data(x, id), if (!is.null(sizeInfo$runtime)) {
  .     tags$script(type = "application/htmlwidget-sizing", `data-for` = id, 
  .         toJSON(sizeInfo$runtime))
  . })
12. widget_data(x, id)
13. toJSON(createPayload(x))
14. createPayload(x)
15. instance$preRenderHook(instance)
16. plotly_build.plotly(instance)
17. map_color(traces, title = paste(colorTitle, collapse = br()))
18. Summary.factor(structure(c(1L, 2L, 3L, 3L, 1L, 1L, 3L, 2L, 3L, 
  . 3L, 1L, 2L, 3L, 2L, 2L, 3L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 3L, 1L, 
  . 2L, 2L, 3L, 1L, 1L, 2L, 3L, 2L, 2L, 3L, 3L, 2L, 3L, 3L, 1L, 1L, 
  . 1L, 3L, 3L, 3L, 1L, 3L, 1L, 1L, 2L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 
  . 1L, 1L, 3L, 1L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 2L, 3L, 3L, 
  . 2L, 3L, 1L, 3L, 3L, 3L, 1L, 1L, 2L, 2L, 1L, 1L, 3L, 3L, 2L, 3L, 
  . 3L, 3L, 3L, 3L, 3L, 1L, 3L, 2L, 1L, 2L, 1L, 3L, 3L, 3L, 1L, 3L, 
  . 3L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
  . 1L, 3L, 2L, 2L, 3L, 2L, 3L, 1L, 1L, 3L, 3L, 2L, 1L, 3L, 3L, 1L, 
  . 3L, 3L, 3L, 3L, 1L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 3L, 2L, 2L, 2L, 
  . 3L, 3L, 3L, 2L, 3L, 2L, 1L, 3L, 3L, 3L, 1L, 3L, 1L, 3L, 3L, 3L, 
  . 2L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 1L, 3L, 3L, 
  . 3L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, 2L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 
  . 3L, 1L, 1L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 1L, 3L, 
  . 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 2L, 3L, 1L, 3L, 1L, 3L, 2L, 
  . 3L, 2L, 3L, 3L, 3L, 2L, 1L, 3L, 3L, 3L, 3L, 3L, 2L, 1L, 1L, 3L, 
  . 1L, 2L, 1L, 3L, 3L, 2L, 1L, 3L, 3L, 1L, 1L, 1L, 3L, 3L, 3L, 2L, 
  . 1L, 2L, 1L, 3L, 3L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 1L, 
  . 2L, 2L, 3L, 1L, 3L, 3L, 1L, 3L, 3L, 3L, 1L, 2L, 2L, 2L, 3L, 2L, 
  . 1L, 2L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 3L, 1L, 3L, 1L, 3L, 1L, 1L, 
  . 3L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 2L, 3L, 1L, 3L, 3L, 3L, 1L, 
  . 2L, 1L, 3L, 3L, 3L, 1L, 3L, 1L, 3L, 3L, 2L, 1L, 3L, 3L, 1L, 2L, 
  . 2L, 3L, 2L, 2L, 3L, 3L, 3L, 2L, 3L, 3L, 2L, 2L, 3L, 1L, 1L, 2L, 
  . 3L, 1L, 2L, 2L, 3L, 3L, 2L, 2L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 2L, 
  . 3L, 3L, 1L, 3L, 1L, 3L, 3L, 2L, 3L, 3L, 3L, 1L, 3L, 3L, 1L, 1L, 
  . 3L, 1L, 2L, 3L, 3L, 1L, 2L), .Label = c("Bad", "Good", "Medium"
  . ), class = "factor"), na.rm = TRUE)
19. stop(gettextf("%s not meaningful for factors", sQuote(.Generic)))
HTML widgets cannot be represented in plain text (need html)

Logistic regression

In [80]:
l <- glm(High~. -Sales, data = Carseats, family=binomial)
In [81]:
summary(l)
Call:
glm(formula = High ~ . - Sales, family = binomial, data = Carseats)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.40315  -0.28543  -0.04319   0.18482   2.98849  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -5.0889857  2.4067745  -2.114   0.0345 *  
CompPrice        0.1724750  0.0233456   7.388 1.49e-13 ***
Income           0.0341248  0.0079002   4.319 1.56e-05 ***
Advertising      0.2957492  0.0512506   5.771 7.90e-09 ***
Population      -0.0009173  0.0014190  -0.646   0.5180    
Price           -0.1674933  0.0196266  -8.534  < 2e-16 ***
ShelveLocGood    8.2408876  1.0328293   7.979 1.48e-15 ***
ShelveLocMedium  3.5970462  0.6520385   5.517 3.46e-08 ***
Age             -0.0793793  0.0142195  -5.582 2.37e-08 ***
Education       -0.0576359  0.0763170  -0.755   0.4501    
UrbanYes        -0.3827861  0.4392739  -0.871   0.3835    
USYes           -0.6986095  0.5949922  -1.174   0.2403    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 541.49  on 399  degrees of freedom
Residual deviance: 181.48  on 388  degrees of freedom
AIC: 205.48

Number of Fisher Scoring iterations: 7
In [82]:
exp(0.295)
1.34312635868628
In [83]:
1-exp(-0.383)
0.318187100714009

This fitted model says that, we will see $34\%$ increase in the odds of High Sales for a one-unit increase in Advertising,since $exp(0.295) = 1.34$, holding the rest fix. holding the rest at a fixed value, the odds of having High Sales for Stores in Urban areas (Urban = 1) over the odds of having High Sales for non Urban stores is $exp(-0.383) = 0.681$. In terms of percent change, we can say that the odds for Urban stores having high sales are 32% lower than the odds for non Urban.

Prediction

In [84]:
set.seed(33)
#data<-read.csv("default1.csv")
test.index <- sample(1:nrow(Carseats), size = 150, replace = F)
test <- Carseats[test.index,]
train <- Carseats[-test.index,]
In [85]:
l <- glm(High~. -Sales, data = train, family=binomial)

In Sample Prediction

In [86]:
l.pred <- predict(l, train, type = "response")
l.pred1 <- as.numeric(l.pred > 0.5)
In [87]:
l.table<-table(l.pred1 ,train$High )
l.table
       
l.pred1   0   1
      0 139  15
      1  14  82
In [89]:
sum(l.pred1)
96
In [179]:
(139+82)/250
0.884
In [180]:
summary.pred<-function(x)
    {
        FPR<-x[1,2]/sum(x[1,])
        TPR<-x[2,2]/sum(x[2,])
        PPR<-x[2,2]/sum(x[,2])
        NPR<-x[1,1]/sum(x[,1])
    
        output<-list("FPR"=round(FPR,2),"TPR"=round(TPR,2),"PPR"=round(PPR,2),"NPR"=round(NPR,2))
    return(output)
    }
In [181]:
l.pred.summary<-summary.pred(l.table)
l.pred.summary
$FPR
0.1
$TPR
0.85
$PPR
0.85
$NPR
0.91

Out of Sample Prediction

In [182]:
l.pred <- predict(l, test, type = "response")
l.pred1 <- as.numeric(l.pred > 0.5)
In [183]:
l.table.test<-table(l.pred1 ,test$High )
l.table.test
       
l.pred1  0  1
      0 80 11
      1  3 56
In [185]:
(80+56)/150
0.906666666666667
In [186]:
l.test.pred.summary<-summary.pred(l.table)
l.test.pred.summary
$FPR
0.1
$TPR
0.85
$PPR
0.85
$NPR
0.91

Tree Based Models

We now use the tree() function to fit a classification tree in order to predict tree() High using all variables but Sales. The syntax of the tree() function is quite similar to that of the lm() function.

In [208]:
library(tree)
In [209]:
tree.carseats =tree(High∼.-Sales ,Carseats )
In [210]:
summary (tree.carseats )
Classification tree:
tree(formula = High ~ . - Sales, data = Carseats)
Variables actually used in tree construction:
[1] "ShelveLoc"   "Price"       "Income"      "CompPrice"   "Population" 
[6] "Advertising" "Age"         "US"         
Number of terminal nodes:  27 
Residual mean deviance:  0.4575 = 170.7 / 373 
Misclassification error rate: 0.09 = 36 / 400 

We see that the training error rate is 9 %. For classification trees, the deviance reported in the output of summary() is given by $$-2 \sum_m \sum_k n_{mk}\log \hat p_{mk}$$ where $n_{mk}$ is the number of observations in the $m$th terminal node that belong to the $k$th class. A small deviance indicates a tree that provides a good fit to the (training) data. The residual mean deviance reported is simply the deviance divided by $n−|T_0|$, which in this case is $400−27 = 373$. One of the most attractive properties of trees is that they can be graphically displayed. We use the plot() function to display the tree structure, and the text() function to display the node labels. The argument pretty=0 instructs R to include the category names for any qualitative predictors, rather than simply displaying a letter for each category.

In [211]:
plot(tree.carseats )
text(tree.carseats ,pretty =0)

The most important indicator of Sales appears to be shelving location, since the first branch differentiates Good locations from Bad and Medium locations. If we just type the name of the tree object, R prints output corresponding to each branch of the tree. R displays the split criterion (e.g. $Price<92.5$), the number of observations in that branch, the deviance, the overall prediction for the branch (Yes or No), and the fraction of observations in that branch that take on values of Yes and No. Branches that lead to terminal nodes are indicated using asterisks.

In [212]:
tree.carseats
node), split, n, deviance, yval, (yprob)
      * denotes terminal node

  1) root 400 541.500 0 ( 0.59000 0.41000 )  
    2) ShelveLoc: Bad,Medium 315 390.600 0 ( 0.68889 0.31111 )  
      4) Price < 92.5 46  56.530 1 ( 0.30435 0.69565 )  
        8) Income < 57 10  12.220 0 ( 0.70000 0.30000 )  
         16) CompPrice < 110.5 5   0.000 0 ( 1.00000 0.00000 ) *
         17) CompPrice > 110.5 5   6.730 1 ( 0.40000 0.60000 ) *
        9) Income > 57 36  35.470 1 ( 0.19444 0.80556 )  
         18) Population < 207.5 16  21.170 1 ( 0.37500 0.62500 ) *
         19) Population > 207.5 20   7.941 1 ( 0.05000 0.95000 ) *
      5) Price > 92.5 269 299.800 0 ( 0.75465 0.24535 )  
       10) Advertising < 13.5 224 213.200 0 ( 0.81696 0.18304 )  
         20) CompPrice < 124.5 96  44.890 0 ( 0.93750 0.06250 )  
           40) Price < 106.5 38  33.150 0 ( 0.84211 0.15789 )  
             80) Population < 177 12  16.300 0 ( 0.58333 0.41667 )  
              160) Income < 60.5 6   0.000 0 ( 1.00000 0.00000 ) *
              161) Income > 60.5 6   5.407 1 ( 0.16667 0.83333 ) *
             81) Population > 177 26   8.477 0 ( 0.96154 0.03846 ) *
           41) Price > 106.5 58   0.000 0 ( 1.00000 0.00000 ) *
         21) CompPrice > 124.5 128 150.200 0 ( 0.72656 0.27344 )  
           42) Price < 122.5 51  70.680 1 ( 0.49020 0.50980 )  
             84) ShelveLoc: Bad 11   6.702 0 ( 0.90909 0.09091 ) *
             85) ShelveLoc: Medium 40  52.930 1 ( 0.37500 0.62500 )  
              170) Price < 109.5 16   7.481 1 ( 0.06250 0.93750 ) *
              171) Price > 109.5 24  32.600 0 ( 0.58333 0.41667 )  
                342) Age < 49.5 13  16.050 1 ( 0.30769 0.69231 ) *
                343) Age > 49.5 11   6.702 0 ( 0.90909 0.09091 ) *
           43) Price > 122.5 77  55.540 0 ( 0.88312 0.11688 )  
             86) CompPrice < 147.5 58  17.400 0 ( 0.96552 0.03448 ) *
             87) CompPrice > 147.5 19  25.010 0 ( 0.63158 0.36842 )  
              174) Price < 147 12  16.300 1 ( 0.41667 0.58333 )  
                348) CompPrice < 152.5 7   5.742 1 ( 0.14286 0.85714 ) *
                349) CompPrice > 152.5 5   5.004 0 ( 0.80000 0.20000 ) *
              175) Price > 147 7   0.000 0 ( 1.00000 0.00000 ) *
       11) Advertising > 13.5 45  61.830 1 ( 0.44444 0.55556 )  
         22) Age < 54.5 25  25.020 1 ( 0.20000 0.80000 )  
           44) CompPrice < 130.5 14  18.250 1 ( 0.35714 0.64286 )  
             88) Income < 100 9  12.370 0 ( 0.55556 0.44444 ) *
             89) Income > 100 5   0.000 1 ( 0.00000 1.00000 ) *
           45) CompPrice > 130.5 11   0.000 1 ( 0.00000 1.00000 ) *
         23) Age > 54.5 20  22.490 0 ( 0.75000 0.25000 )  
           46) CompPrice < 122.5 10   0.000 0 ( 1.00000 0.00000 ) *
           47) CompPrice > 122.5 10  13.860 0 ( 0.50000 0.50000 )  
             94) Price < 125 5   0.000 1 ( 0.00000 1.00000 ) *
             95) Price > 125 5   0.000 0 ( 1.00000 0.00000 ) *
    3) ShelveLoc: Good 85  90.330 1 ( 0.22353 0.77647 )  
      6) Price < 135 68  49.260 1 ( 0.11765 0.88235 )  
       12) US: No 17  22.070 1 ( 0.35294 0.64706 )  
         24) Price < 109 8   0.000 1 ( 0.00000 1.00000 ) *
         25) Price > 109 9  11.460 0 ( 0.66667 0.33333 ) *
       13) US: Yes 51  16.880 1 ( 0.03922 0.96078 ) *
      7) Price > 135 17  22.070 0 ( 0.64706 0.35294 )  
       14) Income < 46 6   0.000 0 ( 1.00000 0.00000 ) *
       15) Income > 46 11  15.160 1 ( 0.45455 0.54545 ) *

In order to properly evaluate the performance of a classification tree on these data, we must estimate the test error rather than simply computing the training error. We split the observations into a training set and a test set, build the tree using the training set, and evaluate its performance on the test data. The predict() function can be used for this purpose. In the case of a classification tree, the argument type="class" instructs R to return the actual class prediction. This approach leads to correct predictions for around $80\%$ of the locations in the test data set.

In [213]:
tree.carseats =tree(High∼.-Sales ,train )
tree.pred=predict (tree.carseats ,test ,type ="class")
t.table<-table(tree.pred ,test$High)
t.table
         
tree.pred  0  1
        0 64 20
        1 19 47
In [214]:
(97+49)/200
0.73
In [215]:
t.test.pred.summary<-summary.pred(t.table)
t.test.pred.summary
$FPR
0.24
$TPR
0.71
$PPR
0.7
$NPR
0.77

Next, we consider whether pruning the tree might lead to improved results. The function cv.tree() performs cross-validation in order to cv.tree() determine the optimal level of tree complexity; cost complexity pruning is used in order to select a sequence of trees for consideration. We use the argument FUN=prune.misclass in order to indicate that we want the classification error rate to guide the cross-validation and pruning process, rather than the default for the cv.tree() function, which is deviance. The cv.tree() function reports the number of terminal nodes of each tree considered (size) as well as the corresponding error rate and the value of the cost-complexity parameter used (k, which corresponds to $\alpha$(Lokk formula in Lecture Notes)).

In [223]:
set.seed (3)
cv.carseats =cv.tree(tree.carseats ,FUN=prune.misclass )
names(cv.carseats )
cv.carseats
  1. 'size'
  2. 'dev'
  3. 'k'
  4. 'method'
$size
[1] 21 19 13  9  5  2  1

$dev
[1] 71 71 66 58 55 71 97

$k
[1]      -Inf  0.000000  1.000000  1.500000  2.500000  8.333333 26.000000

$method
[1] "misclass"

attr(,"class")
[1] "prune"         "tree.sequence"

Note that, despite the name, dev corresponds to the cross-validation error rate in this instance. The tree with $9$ terminal nodes results in the lowest cross-validation error rate, with $50$ cross-validation errors.We plot the error rate as a function of both size and $k$.

In [224]:
par(mfrow =c(1,2))
plot(cv.carseats$size ,cv.carseats$dev ,type="b")
plot(cv.carseats$k ,cv.carseats$dev ,type="b")

We now apply the prune.misclass() function in order to prune the tree to prune. obtain the nine-node tree.

In [225]:
prune.carseats =prune.misclass (tree.carseats ,best =5)
plot(prune.carseats )
text(prune.carseats ,pretty =0)

How well does this pruned tree perform on the test data set? Once again, we apply the predict() function.

Next, we consider whether pruning the tree might lead to improved results. The function cv.tree() performs cross-validation in order to cv.tree() determine the optimal level of tree complexity; cost complexity pruning is used in order to select a sequence of trees for consideration. We use the argument FUN=prune.misclass in order to indicate that we want the classification error rate to guide the cross-validation and pruning process, rather than the default for the cv.tree() function, which is deviance. The cv.tree() function reports the number of terminal nodes of each tree considered (size) as well as the corresponding error rate and the value of the cost-complexity parameter used (k, which corresponds to $\alpha$ ).

In [226]:
tree.pred=predict (prune.carseats , test ,type="class")
t.table<-table(tree.pred ,test$High)
t.table
         
tree.pred  0  1
        0 64 18
        1 19 49
In [227]:
(64+49)/150
0.753333333333333
In [228]:
t.test.pred.summary<-summary.pred(t.table)
t.test.pred.summary
$FPR
0.22
$TPR
0.72
$PPR
0.73
$NPR
0.77

Now $75\%$ of the test observations are correctly classified, so not only has the pruning process produced a more interpretable tree, but it has also improved the classification accuracy.

Bagging

In [136]:
dim(train)
  1. 250
  2. 12
In [231]:
library(randomForest)
set.seed (1)
bag.model =randomForest(High∼.,data=train ,
mtry=11, importance =TRUE)
bag.model
Call:
 randomForest(formula = High ~ ., data = train, mtry = 11, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 11

        OOB estimate of  error rate: 0.4%
Confusion matrix:
    0  1 class.error
0 152  1 0.006535948
1   0 97 0.000000000
In [232]:
bag.pred = predict (bag.model ,newdata =test)
In [233]:
b.table<-table(bag.pred ,test$High)
b.table
        
bag.pred  0  1
       0 83  0
       1  0 67
In [234]:
(83+67)/150
1
In [235]:
b.test.pred.summary<-summary.pred(b.table)
b.test.pred.summary
$FPR
0
$TPR
1
$PPR
1
$NPR
1

Random Forest

In [236]:
set.seed (1)
rf.model =randomForest(High∼.,data=train ,
mtry=4, importance =TRUE)
rf.model
pred.rf = predict (rf.model,newdata =test)
Call:
 randomForest(formula = High ~ ., data = train, mtry = 4, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 4

        OOB estimate of  error rate: 0.4%
Confusion matrix:
    0  1 class.error
0 152  1 0.006535948
1   0 97 0.000000000
In [237]:
rf.table<-table(pred.rf ,test$High)
rf.table
       
pred.rf  0  1
      0 83  0
      1  0 67