In :
rm(list=ls())


### Predicting The Sales of Carseat¶

We first use classification trees to analyze the Carseats data set. The main goal is to predict the Sales of Carseats and find important features that influence the sales. In these data, Sales is a continuous variable, and so we begin by recoding it as a binary variable. We use the ifelse() function to create a variable, called ifelse() High, which takes on a value of Yes if the Sales variable exceeds 8, and takes on a value of No otherwise.

In :
library (ISLR)
library(plotly)


## Data¶

A data frame with 400 observations on the following 11 variables.

• Sales- Unit sales (in thousands) at each location
• CompPrice - Price charged by competitor at each location
• Income - Community income level (in thousands of dollars)
• Advertising - Local advertising budget for company at each location (in thousands of dollars)
• Population - Population size in region (in thousands)
• Price - Price company charges for car seats at each site
• ShelveLoc - A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site
• Age -Average age of the local population
• Education -Education level at each location
• Urban -A factor with levels No and Yes to indicate whether the store is in an urban or rural location
• US -A factor with levels No and Yes to indicate whether the store is in the US or not
In :
head(Carseats)

9.50 138 73 11 276 120 Bad 42 17 Yes Yes
11.22 111 48 16 260 83 Good 65 10 Yes Yes
10.06 113 35 10 269 80 Medium59 12 Yes Yes
7.40 117 100 4 466 97 Medium55 14 Yes Yes
4.15 141 64 3 340 128 Bad 38 13 Yes No
10.81 124 113 13 501 72 Bad 78 16 No Yes
In :
sum(is.na(Carseats))

0
In :
summary(Carseats)

     Sales          CompPrice       Income        Advertising
Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000
1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000
Median : 7.490   Median :125   Median : 69.00   Median : 5.000
Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635
3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000
Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000
Population        Price        ShelveLoc        Age          Education
Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00   Min.   :10.0
1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75   1st Qu.:12.0
Median :272.0   Median :117.0   Medium:219   Median :54.50   Median :14.0
Mean   :264.8   Mean   :115.8                Mean   :53.32   Mean   :13.9
3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00   3rd Qu.:16.0
Max.   :509.0   Max.   :191.0                Max.   :80.00   Max.   :18.0
Urban       US
No :118   No :142
Yes:282   Yes:258



In these data, Sales is a continuous variable, and so we begin by recoding it as a binary variable. We use the ifelse() function to create a variable, called High, which takes on a value of Yes if the Sales variable exceeds 8, and takes on a value of No otherwise.

In :
Carseats$Income  1. 73 2. 48 3. 35 4. 100 5. 64 6. 113 7. 105 8. 81 9. 110 10. 113 11. 78 12. 94 13. 35 14. 28 15. 117 16. 95 17. 32 18. 74 19. 110 20. 76 21. 90 22. 29 23. 46 24. 31 25. 119 26. 32 27. 115 28. 118 29. 74 30. 99 31. 94 32. 58 33. 32 34. 38 35. 54 36. 84 37. 76 38. 41 39. 73 40. 60 41. 98 42. 53 43. 69 44. 42 45. 79 46. 63 47. 90 48. 98 49. 52 50. 93 51. 32 52. 90 53. 40 54. 64 55. 103 56. 81 57. 82 58. 91 59. 93 60. 71 61. 102 62. 32 63. 45 64. 88 65. 67 66. 26 67. 92 68. 61 69. 69 70. 59 71. 81 72. 51 73. 45 74. 90 75. 68 76. 111 77. 87 78. 71 79. 48 80. 67 81. 100 82. 72 83. 83 84. 36 85. 25 86. 103 87. 84 88. 67 89. 42 90. 66 91. 22 92. 46 93. 113 94. 30 95. 97 96. 25 97. 42 98. 82 99. 77 100. 47 101. 69 102. 93 103. 22 104. 91 105. 96 106. 100 107. 33 108. 107 109. 79 110. 65 111. 62 112. 118 113. 99 114. 29 115. 87 116. 35 117. 75 118. 53 119. 88 120. 94 121. 105 122. 89 123. 100 124. 103 125. 113 126. 78 127. 68 128. 48 129. 100 130. 120 131. 84 132. 69 133. 87 134. 98 135. 31 136. 94 137. 75 138. 42 139. 103 140. 62 141. 60 142. 42 143. 84 144. 88 145. 68 146. 63 147. 83 148. 54 149. 119 150. 120 151. 84 152. 58 153. 78 154. 36 155. 69 156. 72 157. 34 158. 58 159. 90 160. 60 161. 28 162. 21 163. 74 164. 64 165. 64 166. 58 167. 67 168. 73 169. 89 170. 41 171. 39 172. 106 173. 102 174. 91 175. 24 176. 89 177. 107 178. 72 179. 71 180. 25 181. 112 182. 83 183. 60 184. 74 185. 33 186. 100 187. 51 188. 32 189. 37 190. 117 191. 37 192. 42 193. 26 194. 70 195. 98 196. 93 197. 28 198. 61 199. 80 200. 88 201. 92 202. 83 203. 78 204. 82 205. 80 206. 22 207. 67 208. 105 209. 54 210. 21 211. 41 212. 118 213. 69 214. 84 215. 115 216. 83 217. 33 218. 44 219. 61 220. 79 221. 120 222. 44 223. 119 224. 45 225. 82 226. 25 227. 33 228. 64 229. 73 230. 104 231. 60 232. 69 233. 80 234. 76 235. 62 236. 32 237. 34 238. 28 239. 24 240. 105 241. 80 242. 63 243. 46 244. 25 245. 30 246. 43 247. 56 248. 114 249. 52 250. 67 251. 105 252. 111 253. 97 254. 24 255. 104 256. 81 257. 40 258. 62 259. 38 260. 36 261. 117 262. 42 263. 77 264. 26 265. 29 266. 35 267. 93 268. 82 269. 57 270. 69 271. 26 272. 56 273. 33 274. 106 275. 93 276. 119 277. 69 278. 48 279. 113 280. 57 281. 86 282. 69 283. 96 284. 110 285. 46 286. 26 287. 118 288. 44 289. 40 290. 77 291. 111 292. 70 293. 66 294. 84 295. 76 296. 35 297. 44 298. 83 299. 63 300. 40 301. 78 302. 93 303. 77 304. 52 305. 98 306. 29 307. 32 308. 92 309. 80 310. 111 311. 65 312. 68 313. 117 314. 81 315. 33 316. 21 317. 36 318. 30 319. 72 320. 45 321. 70 322. 39 323. 50 324. 105 325. 65 326. 69 327. 30 328. 38 329. 66 330. 54 331. 59 332. 63 333. 33 334. 60 335. 117 336. 70 337. 35 338. 38 339. 24 340. 44 341. 29 342. 120 343. 102 344. 42 345. 80 346. 68 347. 107 348. 39 349. 102 350. 27 351. 101 352. 115 353. 103 354. 67 355. 31 356. 100 357. 109 358. 73 359. 96 360. 62 361. 86 362. 25 363. 55 364. 75 365. 21 366. 30 367. 56 368. 106 369. 22 370. 100 371. 41 372. 81 373. 50 374. 71 375. 47 376. 46 377. 60 378. 61 379. 88 380. 111 381. 64 382. 65 383. 28 384. 117 385. 37 386. 73 387. 116 388. 73 389. 89 390. 42 391. 75 392. 63 393. 42 394. 51 395. 58 396. 108 397. 23 398. 26 399. 79 400. 37 In : attach(Carseats)  The following objects are masked from Carseats (pos = 3): Advertising, Age, CompPrice, Education, Income, Population, Price, Sales, ShelveLoc, Urban, US The following objects are masked from Carseats (pos = 5): Advertising, Age, CompPrice, Education, Income, Population, Price, Sales, ShelveLoc, Urban, US  In : Income  1. 73 2. 48 3. 35 4. 100 5. 64 6. 113 7. 105 8. 81 9. 110 10. 113 11. 78 12. 94 13. 35 14. 28 15. 117 16. 95 17. 32 18. 74 19. 110 20. 76 21. 90 22. 29 23. 46 24. 31 25. 119 26. 32 27. 115 28. 118 29. 74 30. 99 31. 94 32. 58 33. 32 34. 38 35. 54 36. 84 37. 76 38. 41 39. 73 40. 60 41. 98 42. 53 43. 69 44. 42 45. 79 46. 63 47. 90 48. 98 49. 52 50. 93 51. 32 52. 90 53. 40 54. 64 55. 103 56. 81 57. 82 58. 91 59. 93 60. 71 61. 102 62. 32 63. 45 64. 88 65. 67 66. 26 67. 92 68. 61 69. 69 70. 59 71. 81 72. 51 73. 45 74. 90 75. 68 76. 111 77. 87 78. 71 79. 48 80. 67 81. 100 82. 72 83. 83 84. 36 85. 25 86. 103 87. 84 88. 67 89. 42 90. 66 91. 22 92. 46 93. 113 94. 30 95. 97 96. 25 97. 42 98. 82 99. 77 100. 47 101. 69 102. 93 103. 22 104. 91 105. 96 106. 100 107. 33 108. 107 109. 79 110. 65 111. 62 112. 118 113. 99 114. 29 115. 87 116. 35 117. 75 118. 53 119. 88 120. 94 121. 105 122. 89 123. 100 124. 103 125. 113 126. 78 127. 68 128. 48 129. 100 130. 120 131. 84 132. 69 133. 87 134. 98 135. 31 136. 94 137. 75 138. 42 139. 103 140. 62 141. 60 142. 42 143. 84 144. 88 145. 68 146. 63 147. 83 148. 54 149. 119 150. 120 151. 84 152. 58 153. 78 154. 36 155. 69 156. 72 157. 34 158. 58 159. 90 160. 60 161. 28 162. 21 163. 74 164. 64 165. 64 166. 58 167. 67 168. 73 169. 89 170. 41 171. 39 172. 106 173. 102 174. 91 175. 24 176. 89 177. 107 178. 72 179. 71 180. 25 181. 112 182. 83 183. 60 184. 74 185. 33 186. 100 187. 51 188. 32 189. 37 190. 117 191. 37 192. 42 193. 26 194. 70 195. 98 196. 93 197. 28 198. 61 199. 80 200. 88 201. 92 202. 83 203. 78 204. 82 205. 80 206. 22 207. 67 208. 105 209. 54 210. 21 211. 41 212. 118 213. 69 214. 84 215. 115 216. 83 217. 33 218. 44 219. 61 220. 79 221. 120 222. 44 223. 119 224. 45 225. 82 226. 25 227. 33 228. 64 229. 73 230. 104 231. 60 232. 69 233. 80 234. 76 235. 62 236. 32 237. 34 238. 28 239. 24 240. 105 241. 80 242. 63 243. 46 244. 25 245. 30 246. 43 247. 56 248. 114 249. 52 250. 67 251. 105 252. 111 253. 97 254. 24 255. 104 256. 81 257. 40 258. 62 259. 38 260. 36 261. 117 262. 42 263. 77 264. 26 265. 29 266. 35 267. 93 268. 82 269. 57 270. 69 271. 26 272. 56 273. 33 274. 106 275. 93 276. 119 277. 69 278. 48 279. 113 280. 57 281. 86 282. 69 283. 96 284. 110 285. 46 286. 26 287. 118 288. 44 289. 40 290. 77 291. 111 292. 70 293. 66 294. 84 295. 76 296. 35 297. 44 298. 83 299. 63 300. 40 301. 78 302. 93 303. 77 304. 52 305. 98 306. 29 307. 32 308. 92 309. 80 310. 111 311. 65 312. 68 313. 117 314. 81 315. 33 316. 21 317. 36 318. 30 319. 72 320. 45 321. 70 322. 39 323. 50 324. 105 325. 65 326. 69 327. 30 328. 38 329. 66 330. 54 331. 59 332. 63 333. 33 334. 60 335. 117 336. 70 337. 35 338. 38 339. 24 340. 44 341. 29 342. 120 343. 102 344. 42 345. 80 346. 68 347. 107 348. 39 349. 102 350. 27 351. 101 352. 115 353. 103 354. 67 355. 31 356. 100 357. 109 358. 73 359. 96 360. 62 361. 86 362. 25 363. 55 364. 75 365. 21 366. 30 367. 56 368. 106 369. 22 370. 100 371. 41 372. 81 373. 50 374. 71 375. 47 376. 46 377. 60 378. 61 379. 88 380. 111 381. 64 382. 65 383. 28 384. 117 385. 37 386. 73 387. 116 388. 73 389. 89 390. 42 391. 75 392. 63 393. 42 394. 51 395. 58 396. 108 397. 23 398. 26 399. 79 400. 37 In : Carseats$High=ifelse(Sales <=8,0,1)

In :
factor(High)

1. 1
2. 1
3. 1
4. 0
5. 0
6. 1
7. 0
8. 1
9. 0
10. 0
11. 1
12. 1
13. 0
14. 1
15. 1
16. 1
17. 0
18. 1
19. 1
20. 1
21. 0
22. 1
23. 0
24. 0
25. 1
26. 1
27. 1
28. 0
29. 0
30. 0
31. 1
32. 1
33. 0
34. 1
35. 0
36. 1
37. 1
38. 0
39. 0
40. 0
41. 0
42. 0
43. 1
44. 0
45. 0
46. 0
47. 1
48. 0
49. 0
50. 1
51. 0
52. 0
53. 0
54. 0
55. 0
56. 0
57. 1
58. 0
59. 0
60. 0
61. 1
62. 0
63. 0
64. 1
65. 0
66. 0
67. 1
68. 1
69. 1
70. 0
71. 1
72. 0
73. 0
74. 1
75. 0
76. 1
77. 1
78. 0
79. 0
80. 1
81. 1
82. 0
83. 1
84. 0
85. 0
86. 1
87. 1
88. 1
89. 0
90. 0
91. 0
92. 0
93. 0
94. 1
95. 1
96. 0
97. 1
98. 0
99. 1
100. 0
101. 0
102. 0
103. 0
104. 0
105. 0
106. 0
107. 0
108. 1
109. 0
110. 1
111. 1
112. 0
113. 0
114. 0
115. 1
116. 1
117. 0
118. 1
119. 0
120. 0
121. 0
122. 1
123. 0
124. 1
125. 1
126. 1
127. 1
128. 0
129. 0
130. 0
131. 1
132. 0
133. 1
134. 0
135. 0
136. 0
137. 0
138. 0
139. 1
140. 1
141. 0
142. 0
143. 0
144. 0
145. 1
146. 1
147. 0
148. 1
149. 0
150. 1
151. 1
152. 1
153. 0
154. 0
155. 0
156. 0
157. 0
158. 1
159. 1
160. 1
161. 0
162. 0
163. 0
164. 0
165. 1
166. 0
167. 0
168. 0
169. 0
170. 1
171. 1
172. 1
173. 1
174. 0
175. 0
176. 0
177. 0
178. 1
179. 1
180. 0
181. 0
182. 0
183. 0
184. 0
185. 1
186. 1
187. 1
188. 0
189. 1
190. 1
191. 1
192. 0
193. 0
194. 1
195. 0
196. 0
197. 0
198. 0
199. 0
200. 0
201. 0
202. 0
203. 0
204. 0
205. 1
206. 0
207. 0
208. 1
209. 0
210. 0
211. 0
212. 1
213. 1
214. 1
215. 0
216. 0
217. 0
218. 0
219. 1
220. 1
221. 1
222. 0
223. 0
224. 0
225. 0
226. 0
227. 0
228. 1
229. 0
230. 1
231. 0
232. 1
233. 1
234. 1
235. 1
236. 0
237. 1
238. 1
239. 0
240. 0
241. 1
242. 1
243. 0
244. 0
245. 1
246. 1
247. 0
248. 0
249. 0
250. 0
251. 1
252. 0
253. 1
254. 0
255. 1
256. 0
257. 0
258. 1
259. 0
260. 0
261. 0
262. 0
263. 0
264. 0
265. 0
266. 0
267. 1
268. 0
269. 0
270. 0
271. 1
272. 0
273. 1
274. 1
275. 0
276. 0
277. 0
278. 0
279. 0
280. 0
281. 0
282. 1
283. 0
284. 0
285. 0
286. 0
287. 0
288. 0
289. 0
290. 1
291. 1
292. 0
293. 1
294. 1
295. 1
296. 0
297. 1
298. 0
299. 1
300. 1
301. 1
302. 0
303. 0
304. 1
305. 1
306. 1
307. 0
308. 0
309. 1
310. 1
311. 1
312. 0
313. 0
314. 1
315. 0
316. 0
317. 1
318. 0
319. 1
320. 0
321. 0
322. 0
323. 1
324. 1
325. 0
326. 1
327. 0
328. 0
329. 0
330. 1
331. 0
332. 1
333. 0
334. 0
335. 0
336. 0
337. 0
338. 1
339. 0
340. 1
341. 0
342. 0
343. 0
344. 0
345. 1
346. 0
347. 1
348. 0
349. 1
350. 1
351. 1
352. 1
353. 1
354. 1
355. 0
356. 0
357. 0
358. 1
359. 0
360. 0
361. 1
362. 1
363. 0
364. 1
365. 1
366. 0
367. 0
368. 1
369. 1
370. 1
371. 0
372. 1
373. 0
374. 0
375. 1
376. 0
377. 1
378. 0
379. 0
380. 0
381. 1
382. 0
383. 0
384. 1
385. 1
386. 0
387. 0
388. 1
389. 1
390. 1
391. 0
392. 0
393. 0
394. 0
395. 0
396. 1
397. 0
398. 0
399. 0
400. 1

Finally, we use the data.frame() function to merge High with the rest of the Carseats data.

In :
head(Carseats)

9.50 138 73 11 276 120 Bad 42 17 Yes Yes 1
11.22 111 48 16 260 83 Good 65 10 Yes Yes 1
10.06 113 35 10 269 80 Medium59 12 Yes Yes 1
7.40 117 100 4 466 97 Medium55 14 Yes Yes 0
4.15 141 64 3 340 128 Bad 38 13 Yes No 0
10.81 124 113 13 501 72 Bad 78 16 No Yes 1
In :
plot(Carseats$Price,Carseats$Advertising,col=c("red","blue")[Carseats$High]) In : plot(Advertising,ShelveLoc,col=c("red","blue")[High]) In : plot_ly(Carseats, x = ~Price, y = ~Advertising, z = ~ShelveLoc, color = ~High, colors = c('#BF382A', '#0C4B8E')) %>% add_markers() %>% layout(scene = list(xaxis = list(title = ''), yaxis = list(title = ''), zaxis = list(title = '')))  ERROR while rich displaying an object: Error in Summary.factor(structure(c(1L, 2L, 3L, 3L, 1L, 1L, 3L, 2L, 3L, : 'range' not meaningful for factors Traceback: 1. FUN(X[[i]], ...) 2. tryCatch(withCallingHandlers({ . rpr <- mime2repr[[mime]](obj) . if (is.null(rpr)) . return(NULL) . prepare_content(is.raw(rpr), rpr) . }, error = error_handler), error = outer_handler) 3. tryCatchList(expr, classes, parentenv, handlers) 4. tryCatchOne(expr, names, parentenv, handlers[[1L]]) 5. doTryCatch(return(expr), name, parentenv, handler) 6. withCallingHandlers({ . rpr <- mime2repr[[mime]](obj) . if (is.null(rpr)) . return(NULL) . prepare_content(is.raw(rpr), rpr) . }, error = error_handler) 7. mime2repr[[mime]](obj) 8. repr_html.htmlwidget(obj) 9. htmlwidgets::saveWidget(obj, htmlfile) 10. toHTML(widget, standalone = TRUE, knitrOptions = knitrOptions) 11. htmltools::tagList(container(htmltools::tagList(x$prepend, widget_html(name = class(x),
.     package = attr(x, "package"), id = id, style = style, class = paste(class(x),
.         "html-widget"), width = sizeInfo$width, height = sizeInfo$height),
.     x$append)), widget_data(x, id), if (!is.null(sizeInfo$runtime)) {
.     tags$script(type = "application/htmlwidget-sizing", data-for = id, . toJSON(sizeInfo$runtime))
. })
12. widget_data(x, id)
15. instance$preRenderHook(instance) 16. plotly_build.plotly(instance) 17. map_color(traces, title = paste(colorTitle, collapse = br())) 18. Summary.factor(structure(c(1L, 2L, 3L, 3L, 1L, 1L, 3L, 2L, 3L, . 3L, 1L, 2L, 3L, 2L, 2L, 3L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 3L, 1L, . 2L, 2L, 3L, 1L, 1L, 2L, 3L, 2L, 2L, 3L, 3L, 2L, 3L, 3L, 1L, 1L, . 1L, 3L, 3L, 3L, 1L, 3L, 1L, 1L, 2L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, . 1L, 1L, 3L, 1L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 2L, 3L, 3L, . 2L, 3L, 1L, 3L, 3L, 3L, 1L, 1L, 2L, 2L, 1L, 1L, 3L, 3L, 2L, 3L, . 3L, 3L, 3L, 3L, 3L, 1L, 3L, 2L, 1L, 2L, 1L, 3L, 3L, 3L, 1L, 3L, . 3L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, . 1L, 3L, 2L, 2L, 3L, 2L, 3L, 1L, 1L, 3L, 3L, 2L, 1L, 3L, 3L, 1L, . 3L, 3L, 3L, 3L, 1L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 3L, 2L, 2L, 2L, . 3L, 3L, 3L, 2L, 3L, 2L, 1L, 3L, 3L, 3L, 1L, 3L, 1L, 3L, 3L, 3L, . 2L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 1L, 3L, 3L, . 3L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, 2L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, . 3L, 1L, 1L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 1L, 3L, . 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 2L, 3L, 1L, 3L, 1L, 3L, 2L, . 3L, 2L, 3L, 3L, 3L, 2L, 1L, 3L, 3L, 3L, 3L, 3L, 2L, 1L, 1L, 3L, . 1L, 2L, 1L, 3L, 3L, 2L, 1L, 3L, 3L, 1L, 1L, 1L, 3L, 3L, 3L, 2L, . 1L, 2L, 1L, 3L, 3L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 1L, . 2L, 2L, 3L, 1L, 3L, 3L, 1L, 3L, 3L, 3L, 1L, 2L, 2L, 2L, 3L, 2L, . 1L, 2L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 3L, 1L, 3L, 1L, 3L, 1L, 1L, . 3L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 2L, 3L, 1L, 3L, 3L, 3L, 1L, . 2L, 1L, 3L, 3L, 3L, 1L, 3L, 1L, 3L, 3L, 2L, 1L, 3L, 3L, 1L, 2L, . 2L, 3L, 2L, 2L, 3L, 3L, 3L, 2L, 3L, 3L, 2L, 2L, 3L, 1L, 1L, 2L, . 3L, 1L, 2L, 2L, 3L, 3L, 2L, 2L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 2L, . 3L, 3L, 1L, 3L, 1L, 3L, 3L, 2L, 3L, 3L, 3L, 1L, 3L, 3L, 1L, 1L, . 3L, 1L, 2L, 3L, 3L, 1L, 2L), .Label = c("Bad", "Good", "Medium" . ), class = "factor"), na.rm = TRUE) 19. stop(gettextf("%s not meaningful for factors", sQuote(.Generic)))  HTML widgets cannot be represented in plain text (need html) ### Logistic regression¶ In : l <- glm(High~. -Sales, data = Carseats, family=binomial)  In : summary(l)  Call: glm(formula = High ~ . - Sales, family = binomial, data = Carseats) Deviance Residuals: Min 1Q Median 3Q Max -2.40315 -0.28543 -0.04319 0.18482 2.98849 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.0889857 2.4067745 -2.114 0.0345 * CompPrice 0.1724750 0.0233456 7.388 1.49e-13 *** Income 0.0341248 0.0079002 4.319 1.56e-05 *** Advertising 0.2957492 0.0512506 5.771 7.90e-09 *** Population -0.0009173 0.0014190 -0.646 0.5180 Price -0.1674933 0.0196266 -8.534 < 2e-16 *** ShelveLocGood 8.2408876 1.0328293 7.979 1.48e-15 *** ShelveLocMedium 3.5970462 0.6520385 5.517 3.46e-08 *** Age -0.0793793 0.0142195 -5.582 2.37e-08 *** Education -0.0576359 0.0763170 -0.755 0.4501 UrbanYes -0.3827861 0.4392739 -0.871 0.3835 USYes -0.6986095 0.5949922 -1.174 0.2403 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 541.49 on 399 degrees of freedom Residual deviance: 181.48 on 388 degrees of freedom AIC: 205.48 Number of Fisher Scoring iterations: 7  In : exp(0.295)  1.34312635868628 In : 1-exp(-0.383)  0.318187100714009 This fitted model says that, we will see$34\%$increase in the odds of High Sales for a one-unit increase in Advertising,since$exp(0.295) = 1.34$, holding the rest fix. holding the rest at a fixed value, the odds of having High Sales for Stores in Urban areas (Urban = 1) over the odds of having High Sales for non Urban stores is$exp(-0.383) = 0.681$. In terms of percent change, we can say that the odds for Urban stores having high sales are 32% lower than the odds for non Urban. #### Prediction¶ In : set.seed(33) #data<-read.csv("default1.csv") test.index <- sample(1:nrow(Carseats), size = 150, replace = F) test <- Carseats[test.index,] train <- Carseats[-test.index,]  In : l <- glm(High~. -Sales, data = train, family=binomial)  #### In Sample Prediction¶ In : l.pred <- predict(l, train, type = "response") l.pred1 <- as.numeric(l.pred > 0.5)  In : l.table<-table(l.pred1 ,train$High )
l.table


l.pred1   0   1
0 139  15
1  14  82
In :
sum(l.pred1)

96
In :
(139+82)/250

0.884
In :
summary.pred<-function(x)
{
FPR<-x[1,2]/sum(x[1,])
TPR<-x[2,2]/sum(x[2,])
PPR<-x[2,2]/sum(x[,2])
NPR<-x[1,1]/sum(x[,1])

output<-list("FPR"=round(FPR,2),"TPR"=round(TPR,2),"PPR"=round(PPR,2),"NPR"=round(NPR,2))
return(output)
}

In :
l.pred.summary<-summary.pred(l.table)
l.pred.summary

$FPR 0.1$TPR
0.85
$PPR 0.85$NPR
0.91

#### Out of Sample Prediction¶

In :
l.pred <- predict(l, test, type = "response")
l.pred1 <- as.numeric(l.pred > 0.5)

In :
l.table.test<-table(l.pred1 ,test$High ) l.table.test   l.pred1 0 1 0 80 11 1 3 56 In : (80+56)/150  0.906666666666667 In : l.test.pred.summary<-summary.pred(l.table) l.test.pred.summary $FPR
0.1
$TPR 0.85$PPR
0.85
$NPR 0.91 ## Tree Based Models¶ We now use the tree() function to fit a classification tree in order to predict tree() High using all variables but Sales. The syntax of the tree() function is quite similar to that of the lm() function. In : library(tree)  In : tree.carseats =tree(High∼.-Sales ,Carseats )  In : summary (tree.carseats )  Classification tree: tree(formula = High ~ . - Sales, data = Carseats) Variables actually used in tree construction:  "ShelveLoc" "Price" "Income" "CompPrice" "Population"  "Advertising" "Age" "US" Number of terminal nodes: 27 Residual mean deviance: 0.4575 = 170.7 / 373 Misclassification error rate: 0.09 = 36 / 400  We see that the training error rate is 9 %. For classification trees, the deviance reported in the output of summary() is given by $$-2 \sum_m \sum_k n_{mk}\log \hat p_{mk}$$ where$n_{mk}$is the number of observations in the$m$th terminal node that belong to the$k$th class. A small deviance indicates a tree that provides a good fit to the (training) data. The residual mean deviance reported is simply the deviance divided by$n−|T_0|$, which in this case is$400−27 = 373$. One of the most attractive properties of trees is that they can be graphically displayed. We use the plot() function to display the tree structure, and the text() function to display the node labels. The argument pretty=0 instructs R to include the category names for any qualitative predictors, rather than simply displaying a letter for each category. In : plot(tree.carseats ) text(tree.carseats ,pretty =0) The most important indicator of Sales appears to be shelving location, since the first branch differentiates Good locations from Bad and Medium locations. If we just type the name of the tree object, R prints output corresponding to each branch of the tree. R displays the split criterion (e.g.$Price<92.5$), the number of observations in that branch, the deviance, the overall prediction for the branch (Yes or No), and the fraction of observations in that branch that take on values of Yes and No. Branches that lead to terminal nodes are indicated using asterisks. In : tree.carseats  node), split, n, deviance, yval, (yprob) * denotes terminal node 1) root 400 541.500 0 ( 0.59000 0.41000 ) 2) ShelveLoc: Bad,Medium 315 390.600 0 ( 0.68889 0.31111 ) 4) Price < 92.5 46 56.530 1 ( 0.30435 0.69565 ) 8) Income < 57 10 12.220 0 ( 0.70000 0.30000 ) 16) CompPrice < 110.5 5 0.000 0 ( 1.00000 0.00000 ) * 17) CompPrice > 110.5 5 6.730 1 ( 0.40000 0.60000 ) * 9) Income > 57 36 35.470 1 ( 0.19444 0.80556 ) 18) Population < 207.5 16 21.170 1 ( 0.37500 0.62500 ) * 19) Population > 207.5 20 7.941 1 ( 0.05000 0.95000 ) * 5) Price > 92.5 269 299.800 0 ( 0.75465 0.24535 ) 10) Advertising < 13.5 224 213.200 0 ( 0.81696 0.18304 ) 20) CompPrice < 124.5 96 44.890 0 ( 0.93750 0.06250 ) 40) Price < 106.5 38 33.150 0 ( 0.84211 0.15789 ) 80) Population < 177 12 16.300 0 ( 0.58333 0.41667 ) 160) Income < 60.5 6 0.000 0 ( 1.00000 0.00000 ) * 161) Income > 60.5 6 5.407 1 ( 0.16667 0.83333 ) * 81) Population > 177 26 8.477 0 ( 0.96154 0.03846 ) * 41) Price > 106.5 58 0.000 0 ( 1.00000 0.00000 ) * 21) CompPrice > 124.5 128 150.200 0 ( 0.72656 0.27344 ) 42) Price < 122.5 51 70.680 1 ( 0.49020 0.50980 ) 84) ShelveLoc: Bad 11 6.702 0 ( 0.90909 0.09091 ) * 85) ShelveLoc: Medium 40 52.930 1 ( 0.37500 0.62500 ) 170) Price < 109.5 16 7.481 1 ( 0.06250 0.93750 ) * 171) Price > 109.5 24 32.600 0 ( 0.58333 0.41667 ) 342) Age < 49.5 13 16.050 1 ( 0.30769 0.69231 ) * 343) Age > 49.5 11 6.702 0 ( 0.90909 0.09091 ) * 43) Price > 122.5 77 55.540 0 ( 0.88312 0.11688 ) 86) CompPrice < 147.5 58 17.400 0 ( 0.96552 0.03448 ) * 87) CompPrice > 147.5 19 25.010 0 ( 0.63158 0.36842 ) 174) Price < 147 12 16.300 1 ( 0.41667 0.58333 ) 348) CompPrice < 152.5 7 5.742 1 ( 0.14286 0.85714 ) * 349) CompPrice > 152.5 5 5.004 0 ( 0.80000 0.20000 ) * 175) Price > 147 7 0.000 0 ( 1.00000 0.00000 ) * 11) Advertising > 13.5 45 61.830 1 ( 0.44444 0.55556 ) 22) Age < 54.5 25 25.020 1 ( 0.20000 0.80000 ) 44) CompPrice < 130.5 14 18.250 1 ( 0.35714 0.64286 ) 88) Income < 100 9 12.370 0 ( 0.55556 0.44444 ) * 89) Income > 100 5 0.000 1 ( 0.00000 1.00000 ) * 45) CompPrice > 130.5 11 0.000 1 ( 0.00000 1.00000 ) * 23) Age > 54.5 20 22.490 0 ( 0.75000 0.25000 ) 46) CompPrice < 122.5 10 0.000 0 ( 1.00000 0.00000 ) * 47) CompPrice > 122.5 10 13.860 0 ( 0.50000 0.50000 ) 94) Price < 125 5 0.000 1 ( 0.00000 1.00000 ) * 95) Price > 125 5 0.000 0 ( 1.00000 0.00000 ) * 3) ShelveLoc: Good 85 90.330 1 ( 0.22353 0.77647 ) 6) Price < 135 68 49.260 1 ( 0.11765 0.88235 ) 12) US: No 17 22.070 1 ( 0.35294 0.64706 ) 24) Price < 109 8 0.000 1 ( 0.00000 1.00000 ) * 25) Price > 109 9 11.460 0 ( 0.66667 0.33333 ) * 13) US: Yes 51 16.880 1 ( 0.03922 0.96078 ) * 7) Price > 135 17 22.070 0 ( 0.64706 0.35294 ) 14) Income < 46 6 0.000 0 ( 1.00000 0.00000 ) * 15) Income > 46 11 15.160 1 ( 0.45455 0.54545 ) * In order to properly evaluate the performance of a classification tree on these data, we must estimate the test error rather than simply computing the training error. We split the observations into a training set and a test set, build the tree using the training set, and evaluate its performance on the test data. The predict() function can be used for this purpose. In the case of a classification tree, the argument type="class" instructs R to return the actual class prediction. This approach leads to correct predictions for around$80\%$of the locations in the test data set. In : tree.carseats =tree(High∼.-Sales ,train ) tree.pred=predict (tree.carseats ,test ,type ="class") t.table<-table(tree.pred ,test$High)
t.table


tree.pred  0  1
0 64 20
1 19 47
In :
(97+49)/200

0.73
In :
t.test.pred.summary<-summary.pred(t.table)
t.test.pred.summary

$FPR 0.24$TPR
0.71
$PPR 0.7$NPR
0.77

Next, we consider whether pruning the tree might lead to improved results. The function cv.tree() performs cross-validation in order to cv.tree() determine the optimal level of tree complexity; cost complexity pruning is used in order to select a sequence of trees for consideration. We use the argument FUN=prune.misclass in order to indicate that we want the classification error rate to guide the cross-validation and pruning process, rather than the default for the cv.tree() function, which is deviance. The cv.tree() function reports the number of terminal nodes of each tree considered (size) as well as the corresponding error rate and the value of the cost-complexity parameter used (k, which corresponds to $\alpha$(Lokk formula in Lecture Notes)).

In :
set.seed (3)
cv.carseats =cv.tree(tree.carseats ,FUN=prune.misclass )
names(cv.carseats )
cv.carseats

1. 'size'
2. 'dev'
3. 'k'
4. 'method'
$size  21 19 13 9 5 2 1$dev
 71 71 66 58 55 71 97

$k  -Inf 0.000000 1.000000 1.500000 2.500000 8.333333 26.000000$method
 "misclass"

attr(,"class")
 "prune"         "tree.sequence"

Note that, despite the name, dev corresponds to the cross-validation error rate in this instance. The tree with $9$ terminal nodes results in the lowest cross-validation error rate, with $50$ cross-validation errors.We plot the error rate as a function of both size and $k$.

In :
par(mfrow =c(1,2))
plot(cv.carseats$size ,cv.carseats$dev ,type="b")
plot(cv.carseats$k ,cv.carseats$dev ,type="b") We now apply the prune.misclass() function in order to prune the tree to prune. obtain the nine-node tree.

In :
prune.carseats =prune.misclass (tree.carseats ,best =5)
plot(prune.carseats )
text(prune.carseats ,pretty =0) How well does this pruned tree perform on the test data set? Once again, we apply the predict() function.

Next, we consider whether pruning the tree might lead to improved results. The function cv.tree() performs cross-validation in order to cv.tree() determine the optimal level of tree complexity; cost complexity pruning is used in order to select a sequence of trees for consideration. We use the argument FUN=prune.misclass in order to indicate that we want the classification error rate to guide the cross-validation and pruning process, rather than the default for the cv.tree() function, which is deviance. The cv.tree() function reports the number of terminal nodes of each tree considered (size) as well as the corresponding error rate and the value of the cost-complexity parameter used (k, which corresponds to $\alpha$ ).

In :
tree.pred=predict (prune.carseats , test ,type="class")
t.table<-table(tree.pred ,test$High) t.table   tree.pred 0 1 0 64 18 1 19 49 In : (64+49)/150  0.753333333333333 In : t.test.pred.summary<-summary.pred(t.table) t.test.pred.summary $FPR
0.22
$TPR 0.72$PPR
0.73
$NPR 0.77 Now$75\%$of the test observations are correctly classified, so not only has the pruning process produced a more interpretable tree, but it has also improved the classification accuracy. ### Bagging¶ In : dim(train)  1. 250 2. 12 In : library(randomForest) set.seed (1) bag.model =randomForest(High∼.,data=train , mtry=11, importance =TRUE) bag.model  Call: randomForest(formula = High ~ ., data = train, mtry = 11, importance = TRUE) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 11 OOB estimate of error rate: 0.4% Confusion matrix: 0 1 class.error 0 152 1 0.006535948 1 0 97 0.000000000 In : bag.pred = predict (bag.model ,newdata =test)  In : b.table<-table(bag.pred ,test$High)
b.table


bag.pred  0  1
0 83  0
1  0 67
In :
(83+67)/150

1
In :
b.test.pred.summary<-summary.pred(b.table)
b.test.pred.summary

$FPR 0$TPR
1
$PPR 1$NPR
1

#### Random Forest¶

In :
set.seed (1)
rf.model =randomForest(High∼.,data=train ,
mtry=4, importance =TRUE)
rf.model
pred.rf = predict (rf.model,newdata =test)

Call:
randomForest(formula = High ~ ., data = train, mtry = 4, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4

OOB estimate of  error rate: 0.4%
Confusion matrix:
0  1 class.error
0 152  1 0.006535948
1   0 97 0.000000000
In :
rf.table<-table(pred.rf ,test\$High)
rf.table


pred.rf  0  1
0 83  0
1  0 67