Information Theory, Inference, and Learning ... - MAELabs UCSD

More documents

Recommendations

Info

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 532 44 — Supervised Learning in Multilayer Networks Within the Bayesian framework for data modelling, it is easy to improve our probabilistic models. For example, if we believe that some input variables in a problem may be irrelevant to the predicted quantity, but we don’t know which, we can define a new model with multiple hyperparameters that captures the idea of uncertain input variable relevance (MacKay, 1994b; Neal, 1996; MacKay, 1995b); these models then infer automatically from the data which are the relevant input variables for a problem. 44.5 Exercises Exercise 44.1. [4 ] How to measure a classifier’s quality. You’ve just written a new classification algorithm and want to measure how well it performs on a test set, and compare it with other classifiers. What performance measure should you use? There are several standard answers. Let’s assume the classifier gives an output y(x), where x is the input, which we won’t discuss further, and that the true target value is t. In the simplest discussions of classifiers, both y and t are binary variables, but you might care to consider cases where y and t are more general objects also. The most widely used measure of performance on a test set is the error rate – the fraction of misclassifications made by the classifier. This measure forces the classifier to give a 0/1 output and ignores any additional information that the classifier might be able to offer – for example, an indication of the firmness of a prediction. Unfortunately, the error rate does not necessarily measure how informative a classifier’s output is. Consider frequency tables showing the joint frequency of the 0/1 output of a classifier (horizontal axis), and the true 0/1 variable (vertical axis). The numbers that we’ll show are percentages. The error rate e is the sum of the two off-diagonal numbers, which we could call the false positive rate e+ and the false negative rate e−. Of the following three classifiers, A and B have the same error rate of 10% and C has a greater error rate of 12%. Classifier A Classifier B Classifier C y 0 1 y 0 1 y 0 1 t t t 0 90 0 0 80 10 0 78 12 1 10 0 1 0 10 1 0 10 But clearly classifier A, which simply guesses that the outcome is 0 for all cases, is conveying no information at all about t; whereas classifier B has an informative output: if y = 0 then we are sure that t really is zero; and if y = 1 then there is a 50% chance that t = 1, as compared to the prior probability P (t = 1) = 0.1. Classifier C is slightly less informative than B, but it is still much more useful than the information-free classifier A. How common sense ranks the One way to improve on the error rate as a performance measure is to report the pair (e+, e−), the false positive error rate and the false negative error rate, which are (0, 0.1) and (0.1, 0) for classifiers A and B. It is especially important to distinguish between these two error probabilities in applications where the two sorts of error have different associated costs. However, there are a couple of problems with the ‘error rate pair’: • First, if I simply told you that classifier A has error rates (0, 0.1) and B has error rates (0.1, 0), it would not be immediately evident that classifier A is actually utterly worthless. Surely we should have a performance measure that gives the worst possible score to A! classifiers: (best) B > C > A (worst). How error rate ranks the classifiers: (best) A = B > C (worst).
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 44.5: Exercises 533 • Second, if we turn to a multiple-class classification problem such as digit recognition, then the number of types of error increases from two to 10 × 9 = 90 – one for each possible confusion of class t with t ′ . It would be nice to have some sensible way of collapsing these 90 numbers into a single rankable number that makes more sense than the error rate. Another reason for not liking the error rate is that it doesn’t give a classifier credit for accurately specifying its uncertainty. Consider classifiers that have three outputs available, ‘0’, ‘1’ and a rejection class, ‘?’, which indicates that the classifier is not sure. Consider classifiers D and E with the following frequency tables, in percentages: Classifier D Classifier E y 0 ? 1 y 0 ? 1 t t 0 74 10 6 0 78 6 6 1 0 1 9 1 0 5 5 Both of these classifiers have (e+, e−, r) = (6%, 0%, 11%). But are they equally good classifiers? Compare classifier E with C. The two classifiers are equivalent. E is just C in disguise – we could make E by taking the output of C and tossing a coin when C says ‘1’ in order to decide whether to give output ‘1’ or ‘?’. So E is equal to C and thus inferior to B. Now compare D with B. Can you justify the suggestion that D is a more informative classifier than B, and thus is superior to E? Yet D and E have the same (e+, e−, r) scores. People often plot error-reject curves (also known as ROC curves; ROC stands for ‘receiver operating characteristic’) which show the total e = (e+ + e−) versus r as r is allowed to vary from 0 to 1, and use these curves to compare classifiers (figure 44.7). [In the special case of binary classification problems, e+ may be plotted versus e− instead.] But as we have seen, error rates can be undiscerning performance measures. Does plotting one error rate as a function of another make this weakness of error rates go away? For this exercise, either construct an explicit example demonstrating that the error-reject curve, and the area under it, are not necessarily good ways to compare classifiers; or prove that they are. As a suggested alternative method for comparing classifiers, consider the mutual information between the output and the target, I(T ; Y ) ≡ H(T ) − H(T | Y ) = P (t) P (y)P (t | y) log , (44.12) P (t | y) y,t which measures how many bits the classifier’s output conveys about the target. Evaluate the mutual information for classifiers A–E above. Investigate this performance measure and discuss whether it is a useful one. Does it have practical drawbacks? Error rate Rejection rate Figure 44.7. An error-reject curve. Some people use the area under this curve as a measure of classifier quality.
Page 1 and 2:
Copyright Cambridge University Pres
Page 3 and 4:
Page 5 and 6:
Page 7 and 8:
Page 9 and 10:
Page 11 and 12:
Page 13 and 14:
Page 15 and 16:
Page 17 and 18:
Page 19 and 20:
Page 21 and 22:
Page 23 and 24:
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Page 61 and 62:
Page 63 and 64:
Page 65 and 66:
Page 67 and 68:
Page 69 and 70:
Page 71 and 72:
Page 73 and 74:
Page 75 and 76:
Page 77 and 78:
Page 79 and 80:
Page 81 and 82:
Page 83 and 84:
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
Page 105 and 106:
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Page 113 and 114:
Page 115 and 116:
Page 117 and 118:
Page 119 and 120:
Page 121 and 122:
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129 and 130:
Page 131 and 132:
Page 133 and 134:
Page 135 and 136:
Page 137 and 138:
Page 139 and 140:
Page 141 and 142:
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Page 151 and 152:
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Page 161 and 162:
Page 163 and 164:
Page 165 and 166:
Page 167 and 168:
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 185 and 186:
Page 187 and 188:
Page 189 and 190:
Page 191 and 192:
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200:
Page 201 and 202:
Page 203 and 204:
Page 205 and 206:
Page 207 and 208:
Page 209 and 210:
Page 211 and 212:
Page 213 and 214:
Page 215 and 216:
Page 217 and 218:
Page 219 and 220:
Page 221 and 222:
Page 223 and 224:
Page 225 and 226:
Page 227 and 228:
Page 229 and 230:
Page 231 and 232:
Page 233 and 234:
Page 235 and 236:
Page 237 and 238:
Page 239 and 240:
Page 241 and 242:
Page 243 and 244:
Page 245 and 246:
Page 247 and 248:
Page 249 and 250:
Page 251 and 252:
Page 253 and 254:
Page 255 and 256:
Page 257 and 258:
Page 259 and 260:
Page 261 and 262:
Page 263 and 264:
Page 265 and 266:
Page 267 and 268:
Page 269 and 270:
Page 271 and 272:
Page 273 and 274:
Page 275 and 276:
Page 277 and 278:
Page 279 and 280:
Page 281 and 282:
Page 283 and 284:
Page 285 and 286:
Page 287 and 288:
Page 289 and 290:
Page 291 and 292:
Page 293 and 294:
Page 295 and 296:
Page 297 and 298:
Page 299 and 300:
Page 301 and 302:
Page 303 and 304:
Page 305 and 306:
Page 307 and 308:
Page 309 and 310:
Page 311 and 312:
Page 313 and 314:
Page 315 and 316:
Page 317 and 318:
Page 319 and 320:
Page 321 and 322:
Page 323 and 324:
Page 325 and 326:
Page 327 and 328:
Page 329 and 330:
Page 331 and 332:
Page 333 and 334:
Page 335 and 336:
Page 337 and 338:
Page 339 and 340:
Page 341 and 342:
Page 343 and 344:
Page 345 and 346:
Page 347 and 348:
Page 349 and 350:
Page 351 and 352:
Page 353 and 354:
Page 355 and 356:
Page 357 and 358:
Page 359 and 360:
Page 361 and 362:
Page 363 and 364:
Page 365 and 366:
Page 367 and 368:
Page 369 and 370:
Page 371 and 372:
Page 373 and 374:
Page 375 and 376:
Page 377 and 378:
Page 379 and 380:
Page 381 and 382:
Page 383 and 384:
Page 385 and 386:
Page 387 and 388:
Page 389 and 390:
Page 391 and 392:
Page 393 and 394:
Page 395 and 396:
Page 397 and 398:
Page 399 and 400:
Page 401 and 402:
Page 403 and 404:
Page 405 and 406:
Page 407 and 408:
Page 409 and 410:
Page 411 and 412:
Page 413 and 414:
Page 415 and 416:
Page 417 and 418:
Page 419 and 420:
Page 421 and 422:
Page 423 and 424:
Page 425 and 426:
Page 427 and 428:
Page 429 and 430:
Page 431 and 432:
Page 433 and 434:
Page 435 and 436:
Page 437 and 438:
Page 439 and 440:
Page 441 and 442:
Page 443 and 444:
Page 445 and 446:
Page 447 and 448:
Page 449 and 450:
Page 451 and 452:
Page 453 and 454:
Page 455 and 456:
Page 457 and 458:
Page 459 and 460:
Page 461 and 462:
Page 463 and 464:
Page 465 and 466:
Page 467 and 468:
Page 469 and 470:
Page 471 and 472:
Page 473 and 474:
Page 475 and 476:
Page 477 and 478:
Page 479 and 480:
Page 481 and 482:
Page 483 and 484:
Page 485 and 486:
Page 487 and 488:
Page 489 and 490:
Page 491 and 492:
Page 493 and 494: Copyright Cambridge University Pres
Page 543: Copyright Cambridge University Pres
Page 595 and 596:
Page 597 and 598:
Page 599 and 600:
Page 601 and 602:
Page 603 and 604:
Page 605 and 606:
Page 607 and 608:
Page 609 and 610:
Page 611 and 612:
Page 613 and 614:
Page 615 and 616:
Page 617 and 618:
Page 619 and 620:
Page 621 and 622:
Page 623 and 624:
Page 625 and 626:
Page 627 and 628:
Page 629 and 630:
Page 631 and 632:
Page 633 and 634:
Page 635 and 636:
Page 637 and 638:
Page 639 and 640:
show all

Information Theory, Inference, and Learning ... - MAELabs UCSD

Create successful ePaper yourself

Delete template?

Save as template?