Learning Statistics with R - A tutorial for psychology students and other beginners, 2018a

More documents

Recommendations

Info

• Orientation of the axis labels las. I presume that the name of this parameter is an acronym of label style or something along those lines; but what it actually does is govern the orientation of the text used to label the individual tick marks (i.e., the numbering, not the xlab and ylab axis labels). There are four possible values for las: A value of 0 means that the labels of both axes are printed parallel to the axis itself (the default). A value of 1 means that the text is always horizontal. A value of 2 means that the labelling text is printed at right angles to the axis. Finally, a value of 3 means that the text is always vertical. Again, these aren’t the only possibilities. There are a few other graphical parameters that I haven’t mentioned that you could use to customise the appearance of the axes, 7 but that’s probably enough (or more than enough) for now. To give a sense of how you could use these parameters, let’s try the following command: > plot( x = Fibonacci, # the data + xaxt = "n", # don’t draw the x-axis + bty = "]", # keep bottom, right and top of box only + las = 1 # rotate the text + ) The output is shown in Figure 6.9. As you can see, this isn’t a very useful plot at all. However, it does illustrate the graphical parameters we’re talking about, so I suppose it serves its purpose. 6.2.6 Don’t panic At this point, a lot of readers will be probably be thinking something along the lines of, “if there’s this much detail just for drawing a simple plot, how horrible is it going to get when we start looking at more complicated things?” Perhaps, contrary to my earlier pleas for mercy, you’ve found a brick to hurl and are right now leafing through an Adelaide phone book trying to find my address. Well, fear not! And please, put the brick down. In a lot of ways, we’ve gone through the hardest part: we’ve already covered vast majority of the plot customisations that you might want to do. As you’ll see, each of the other high level plotting commands we’ll talk about will only have a smallish number of additional options. Better yet, even though I’ve told you about a billion different ways of tweaking your plot, you don’t usually need them. So in practice, now that you’ve read over it once to get the gist, the majority of the content of this section is stuff you can safely forget: just remember to come back to this section later on when you want to tweak your plot. 6.3 Histograms Now that we’ve tamed (or possibly fled from) the beast that is R graphical parameters, let’s talk more seriously about some real life graphics that you’ll want to draw. We begin with the humble histogram. Histograms are one of the simplest and most useful ways of visualising data. They make most sense when you have an interval or ratio scale (e.g., the afl.margins data from Chapter 5) and what you want to do is get an overall impression of the data. Most of you probably know how histograms work, since they’re so widely used, but for the sake of completeness I’ll describe them. All you do is divide up the possible values into bins, and then count the number of observations that fall within each bin. This count is referred to as the frequency of the bin, and is displayed as a bar: in the AFL winning margins data, there are 33 games in which the winning margin was less than 10 points, and it is this fact that is represented 7 Also, there’s a low level function called axis() that allows a lot more control over the appearance of the axes. - 170 -
y the height of the leftmost bar in Figure 6.10a. Drawing this histogram in R is pretty straightforward. The function you need to use is called hist(), and it has pretty reasonable default settings. In fact, Figure 6.10a is exactly what you get if you just type this: > hist( afl.margins ) # panel a Although this image would need a lot of cleaning up in order to make a good presentation graphic (i.e., one you’d include in a report), it nevertheless does a pretty good job of describing the data. In fact, the big strength of a histogram is that (properly used) it does show the entire spread of the data, so you can get a pretty good sense about what it looks like. The downside to histograms is that they aren’t very compact: unlike some of the other plots I’ll talk about it’s hard to cram 20-30 histograms into a single image without overwhelming the viewer. And of course, if your data are nominal scale (e.g., the afl.finalists data) then histograms are useless. The main subtlety that you need to be aware of when drawing histograms is determining where the breaks that separate bins should be located, and (relatedly) how many breaks there should be. In Figure 6.10a, you can see that R has made pretty sensible choices all by itself: the breaks are located at 0, 10, 20, . . . 120, which is exactly what I would have done had I been forced to make a choice myself. On the other hand, consider the two histograms in Figure 6.10b and6.10c, which I produced using the following two commands: > hist( x = afl.margins, breaks = 3 ) # panel b > hist( x = afl.margins, breaks = 0:116 ) # panel c In Figure 6.10c, the bins are only 1 point wide. As a result, although the plot is very informative (it displays the entire data set with no loss of information at all!) the plot is very hard to interpret, and feels quite cluttered. On the other hand, the plot in Figure 6.10b has a bin width of 50 points, and has the opposite problem: it’s very easy to “read” this plot, but it doesn’t convey a lot of information. One gets the sense that this histogram is hiding too much. In short, the way in which you specify the breaks has a big effect on what the histogram looks like, so it’s important to make sure you choose the breaks sensibly. In general R does a pretty good job of selecting the breaks on its own, since it makes use of some quite clever tricks that statisticians have devised for automatically selecting the right bins for a histogram, but nevertheless it’s usually a good idea to play around with the breaks a bit to see what happens. There is one fairly important thing to add regarding how the breaks argument works. There are two different ways you can specify the breaks. You can either specify how many breaks you want (which is what I did for panel b when I typed breaks = 3) andletR figure out where they should go, or you can provide a vector that tells R exactly where the breaks should be placed (which is what I did for panel c when I typed breaks = 0:116). The behaviour of the hist() function is slightly different depending on which version you use. If all you do is tell it how many breaks you want, R treats it as a “suggestion” not as a demand. It assumes you want “approximately 3” breaks, but if it doesn’t think that this would look very pretty on screen, it picks a different (but similar) number. It does this for a sensible reason – it tries to make sure that the breaks are located at sensible values (like 10) rather than stupid ones (like 7.224414). And most of the time R is right: usually, when a human researcher says “give me 3 breaks”, he or she really does mean “give me approximately 3 breaks, and don’t put them in stupid places”. However, sometimes R is dead wrong. Sometimes you really do mean “exactly 3 breaks”, and you know precisely where you want them to go. So you need to invoke “real person privilege”, and order R to do what it’s bloody well told. In order to do that, you have to input the full vector that tells R exactly where you want the breaks. If you do that, R will go back to behaving like the nice little obedient calculator that it’s supposed to be. - 171 -
Page 2 and 3:
Learning statistics with R: A tutor
Page 4 and 5:
This book is published under a Crea
Page 6 and 7:
Preface Table of Contents I Backgro
Page 8 and 9:
8.6 Summary . . . . . . . . . . . .
Page 10 and 11:
VI Endings, alternatives and prospe
Page 12 and 13:
February 16, 2015 Preface to Versio
Page 14 and 15:
and density is discussed. A detaile
Page 17 and 18:
1. Why do we learn statistics? “T
Page 19 and 20:
And finally, an invalid argument wi
Page 21 and 22:
Admission rate (both genders) 0 20
Page 23 and 24:
experiment, and they don’t get an
Page 25 and 26:
2. A brief introduction to research
Page 27 and 28:
different ways: the length of time
Page 29 and 30:
that when I ask 100 people how they
Page 31 and 32:
Table 2.1: The relationship between
Page 33 and 34:
2.3 Assessing the reliability of a
Page 35 and 36:
2.5.1 Experimental research The key
Page 37 and 38:
things “inside” the study. Let
Page 39 and 40:
they lack face validity, you’ll f
Page 41 and 42:
that age is growing at an incredibl
Page 43 and 44:
then it can be a big problem. 2.7.7
Page 45 and 46:
2.7.10 Placebo effects The placebo
Page 47 and 48:
it doesn’t, which one do you thin
Page 49:
Part II. An introduction to R - 35
Page 52 and 53:
go out into the real world. It’s
Page 54 and 55:
download Rstudio. To understand why
Page 56 and 57:
Most of this text is pretty uninter
Page 58 and 59:
citation() To cite R in publication
Page 60 and 61:
Table 3.1: Basic arithmetic operati
Page 62 and 63:
Division / and Multiplication *, th
Page 64 and 65:
sales * royalty [1] 2450 As far as
Page 66 and 67:
x to the power of 0.5”. Personall
Page 68 and 69:
As you can see, specifying the argu
Page 70 and 71:
Figure 3.3: If you’ve typed the n
Page 72 and 73:
sales.by.month sales.by.month [1]
Page 74 and 75:
profit / days.per.month [1] 0.00000
Page 76 and 77:
not true of R. R is not infinitely
Page 78 and 79:
Table 3.3: Some more logical operat
Page 80 and 81:
In other words, any.sales.this.mont
Page 82 and 83:
and gotten exactly the same result.
Page 84 and 85:
Figure 3.6: The options window in R
Page 86 and 87:
- 72 -
Page 88 and 89:
print( keeper ) [1] 8.539539 So, fr
Page 90 and 91:
to import an SPSS data file into R,
Page 92 and 93:
The output here is telling you that
Page 94 and 95:
Figure 4.3: The Rstudio dialog box
Page 96 and 97:
Figure 4.5: The Rstudio “Environm
Page 98 and 99:
this should be obvious already. But
Page 100 and 101:
Figure 4.6: The “file panel” is
Page 102 and 103:
from this file into my workspace. T
Page 104 and 105:
2 February 28 100 high 3 March 31 2
Page 106 and 107:
Figure 4.9: The Rstudio window for
Page 108 and 109:
4.6.1 Special values The first thin
Page 110 and 111:
x y x * y Error in x * y : non-nu
Page 112 and 113:
4.7.1 Introducing factors Suppose,
Page 114 and 115:
factors in 7.11.2, but for now you
Page 116 and 117:
An alternative method is to use the
Page 118 and 119:
4.11 Generic functions There’s on
Page 120 and 121:
load (base) R Documentation The R D
Page 122 and 123:
Warning Saved R objects are binary
Page 124 and 125:
- 110 -
Page 127 and 128:
5. Descriptive statistics Any time
Page 129 and 130:
mean of these observations is just:
Page 131 and 132:
mean( afl.margins[1:5] ) [1] 36.6 A
Page 133 and 134: the median rises only to $62,500. I
Page 135 and 136: Next, let’s calculate means and m
Page 137 and 138: 5.2 Measures of variability The sta
Page 139 and 140: English: which game value deviation
Page 141 and 142: and you get the same... no, wait...
Page 143 and 144: Mean − SD Mean Mean + SD Figure 5
Page 145 and 146: Negative Skew No Skew Positive Skew
Page 147 and 148: Platykurtic ("too flat") Mesokurtic
Page 149 and 150: e useful, let’s try this for a ne
Page 151 and 152: argument specifies the grouping var
Page 153 and 154: grumpiness score. 13 If the mean is
Page 155 and 156: Frequency 0 5 10 15 20 Frequency 0
Page 157 and 158: My grumpiness 40 50 60 70 80 90 4 6
Page 159 and 160: following extract illustrates the b
Page 161 and 162: Y1 4 5 6 7 8 9 10 Y2 3 4 5 6 7 8 9
Page 163 and 164: Okay, so let’s have a look at our
Page 165 and 166: pay 0.760 0.720 . 0.137 . 0.196 . d
Page 167 and 168: 5 6.68 9.75 NA 5 6 5.99 5.04 72 6 B
Page 169 and 170: • Measures of variability. In con
Page 171 and 172: 6. Drawing graphs Above all else sh
Page 173 and 174: delete the plot and then type a new
Page 175 and 176: Fibonacci 2 4 6 8 10 12 1 2 3 4 5 6
Page 177 and 178: high-level functions all rely on th
Page 179 and 180: type = ’p’ type = ’o’ type
Page 181 and 182: Fibonacci 2 4 6 8 10 12 1 2 3 4 5 6
Page 183: Figure 6.8: Altering the scale and
Page 187 and 188: 6.3.1 Visual style of your histogra
Page 189 and 190: 0 20 40 60 80 100 120 (a) (b) Figur
Page 191 and 192: 0 20 40 60 80 100 120 Winning Margi
Page 193 and 194: 0 50 100 150 200 250 300 Winning Ma
Page 195 and 196: 6.5.3 Drawing multiple boxplots One
Page 197 and 198: Winning Margin 0 50 100 150 1987 19
Page 199 and 200: The second way do to it is to use a
Page 201 and 202: cor( x = parenthood ) # calculate c
Page 203 and 204: 0 10 20 30 0 10 20 30 Adelaide Fitz
Page 205 and 206: value for mar is c(5.1, 4.1, 4.1, 2
Page 207 and 208: This takes the “active” figure
Page 209 and 210: 7. Pragmatic matters The garden of
Page 211 and 212: in the command above I didn’t nam
Page 213 and 214: speaker ee onk oo pip makka-pakka 0
Page 215 and 216: 2 7 3 3 1 3 3 -1 1 -1 BLAH BLAH BLA
Page 217 and 218: seen it before. In any case, those
Page 219 and 220: Table 7.2: Two more arithmetic oper
Page 221 and 222: fact R does provide a function for
Page 223 and 224: But suppose, on the other hand, tha
Page 225 and 226: is select several different variabl
Page 227 and 228: column number. So, if we want to pi
Page 229 and 230: case.2 upsy-daisy pip 2 case.4 makk
Page 231 and 232: Note that R will only allow you to
Page 233 and 234: 7.6.2 Sorting a factor You can also
Page 235 and 236:
Two other methods that I want to br
Page 237 and 238:
At this point you should have two q
Page 239 and 240:
(alcohol, caffeine or no drug). Bec
Page 241 and 242:
3 3 female 4.6 643 7.4 226 7.3 412
Page 243 and 244:
discuss melt() and cast() in a fair
Page 245 and 246:
paste( hw, ng, sep = ".", collapse
Page 247 and 248:
Table 7.3: The ordering of various
Page 249 and 250:
Table 7.4: Standard escape characte
Page 251 and 252:
sub( pattern = "a", replacement = "
Page 253 and 254:
Figure 7.1: The booksales2.csv data
Page 255 and 256:
note that if you don’t have the R
Page 257 and 258:
etween the two. However, there are
Page 259 and 260:
a third variable to the cross-tabs
Page 261 and 262:
today print(today) # display the d
Page 263 and 264:
encode numbers in binary, 29 0.1 is
Page 265 and 266:
environment. And so when you type i
Page 267 and 268:
8. Basic programming Machine dreams
Page 269 and 270:
for ages, you figure out some reall
Page 271 and 272:
Figure 8.2: A screenshot showing th
Page 273 and 274:
8.1.5 Differences between scripts a
Page 275 and 276:
to the second value in vector; and
Page 277 and 278:
crunching, we tell R (on line 21) t
Page 279 and 280:
What this does is create a function
Page 281 and 282:
hundreds of other things besides. H
Page 283:
Part IV. Statistical theory - 269 -
Page 286 and 287:
me: I guess I don’t see that. Sur
Page 288 and 289:
- 274 -
Page 290 and 291:
discussion of probability theory is
Page 292 and 293:
In this case 11 of these 20 coin fl
Page 294 and 295:
frequentists do this sometimes. And
Page 296 and 297:
Probability of event 0.0 0.1 0.2 0.
Page 298 and 299:
proceed to roll all 20 dice, what
Page 300 and 301:
(a) Probability 0.00 0.05 0.10 0.15
Page 302 and 303:
In other words, there is a 76.9% ch
Page 304 and 305:
Probability Density 0.0 0.1 0.2 0.3
Page 306 and 307:
Shaded Area = 68.3% Shaded Area = 9
Page 308 and 309:
Probability Density 0.0 0.1 0.2 0.3
Page 310 and 311:
• The χ 2 distribution is anothe
Page 312 and 313:
work rather than rchisq() - the obs
Page 314 and 315:
- 300 -
Page 316 and 317:
10.1.1 Defining a population A samp
Page 318 and 319:
Figure 10.2: Biased sampling withou
Page 320 and 321:
a stratified sampling technique you
Page 322 and 323:
10.2 The law of large numbers In th
Page 324 and 325:
Table 10.1: Ten replications of the
Page 326 and 327:
Sample Size = 1 Sample Size = 2 Sam
Page 328 and 329:
Sample Size = 1 Sample Size = 2 0.0
Page 330 and 331:
efer to them. For instance, if true
Page 332 and 333:
Average Sample Mean 96 98 100 102 1
Page 334 and 335:
10.5 Estimating a confidence interv
Page 336 and 337:
sense idea of what it means to say
Page 338 and 339:
Average Attendance 0 10000 30000 19
Page 340 and 341:
Here’s how to plot the means and
Page 342 and 343:
Let’s suppose that this glorious
Page 344 and 345:
Dan’s research hypothesis: “ESP
Page 346 and 347:
the type I error rate. As Blackston
Page 348 and 349:
Critical Regions for a Two−Sided
Page 350 and 351:
Critical Region for a One−Sided T
Page 352 and 353:
one based on Neyman’s approach to
Page 354 and 355:
alternative hypothesis is a near ce
Page 356 and 357:
Sampling Distribution for X if θ=.
Page 358 and 359:
Table 11.2: A crude guide to unders
Page 360 and 361:
ureaucratic process. It’s not par
Page 362 and 363:
(binomial test was non significant)
Page 365 and 366:
12. Categorical data analysis Now t
Page 367 and 368:
observations that fall within the i
Page 369 and 370:
it is when it predicts too many (wh
Page 371 and 372:
df = 3 df = 4 df = 5 0 2 4 6 8 10 V
Page 373 and 374:
That’s why I told R to use the up
Page 375 and 376:
clubs 35 40 0.2 diamonds 51 60 0.3
Page 377 and 378:
only one of many (albeit one of the
Page 379 and 380:
That’s more or less what we’re
Page 381 and 382:
Well, since these probabilities hav
Page 383 and 384:
Chapek 9 has an unfortunate tendenc
Page 385 and 386:
12.5 Assumptions of the test(s) All
Page 387 and 388:
To get the test of independence, al
Page 389 and 390:
This is a bit more output than we g
Page 391 and 392:
subj.1 : 1 no :70 no :90 subj.10 :
Page 393 and 394:
13. Comparing two means In the prev
Page 395 and 396:
40 50 60 70 80 90 Grades Figure 13.
Page 397 and 398:
Two Sided Test One Sided Test −1.
Page 399 and 400:
lower.area print( lower.area ) [1]
Page 401 and 402:
df = 2 df = 10 −4 −2 0 2 4 valu
Page 403 and 404:
than I’ve done here. For instance
Page 405 and 406:
Anastasia’s students Bernadette
Page 407 and 408:
null hypothesis μ alternative hypo
Page 409 and 410:
ut once again I’m going to start
Page 411 and 412:
for the difference “(mean 1) minu
Page 413 and 414:
null hypothesis μ alternative hypo
Page 415 and 416:
13.5.1 The data The data set that w
Page 417 and 418:
ciMean( x = chico$improvement ) 2.5
Page 419 and 420:
Paired samples t-test Variables: gr
Page 421 and 422:
Grouping variable: ID variable: tim
Page 423 and 424:
In a one-sided confidence interval,
Page 425 and 426:
are different tests. Secondly, I wa
Page 427 and 428:
Table 13.1: A (very) rough guide to
Page 429 and 430:
ut you no longer believe that the c
Page 431 and 432:
Frequency 0 5 10 15 20 Normally Dis
Page 433 and 434:
Sampling distribution of W (for nor
Page 435 and 436:
We then count up the number of chec
Page 437 and 438:
• A paired samples t-test is used
Page 439 and 440:
14. Comparing several means (one-wa
Page 441 and 442:
Mood Gain 0.5 1.0 1.5 placebo anxif
Page 443 and 444:
This formula looks pretty much iden
Page 445 and 446:
is true, then you’d expect all th
Page 447 and 448:
mean square, MS w , can be viewed a
Page 449 and 450:
3 placebo 0.1 0.45 -0.350 0.1225 4
Page 451 and 452:
is choose an α level (i.e., accept
Page 453 and 454:
names( my.anova ) [1] "coefficients
Page 455 and 456:
etaSquared( x = my.anova ) eta.sq e
Page 457 and 458:
One thing that bugs me slightly abo
Page 459 and 460:
If we compare these three p-values
Page 461 and 462:
• Homogeneity of variance. Notice
Page 463 and 464:
Secondly, I did mention that it’s
Page 465 and 466:
Frequency 0 1 2 3 4 5 6 7 Sample Qu
Page 467 and 468:
Looking at this table, notice that
Page 469 and 470:
• Post hoc analysis and correctio
Page 471 and 472:
15. Linear regression The goal in t
Page 473 and 474:
The Best Fitting Regression Line No
Page 475 and 476:
• data. The data frame containing
Page 477 and 478:
Figure 15.4: A 3D visualisation of
Page 479 and 480:
Wonderful. A big number that doesn
Page 481 and 482:
If our regression model has K predi
Page 483 and 484:
You can see why this is handy, sinc
Page 485 and 486:
what it is. The test for the signif
Page 487 and 488:
Fortunately, confidence intervals f
Page 489 and 490:
• Uncorrelated predictors. The id
Page 491 and 492:
Outlier Outcome Predictor Figure 15
Page 493 and 494:
Outcome High influence Predictor Fi
Page 495 and 496:
Cook’s distance 0.00 0.02 0.04 0.
Page 497 and 498:
Standardized residuals −2 −1 0
Page 499 and 500:
78 Residuals vs Fitted Residuals
Page 501 and 502:
In a lot of cases, the solution to
Page 503 and 504:
Signif. codes: 0 *** 0.001 ** 0.01
Page 505 and 506:
15.10.1 Backward elimination Okay,
Page 507 and 508:
+ day 1 1.55760 1837.2 297.08 + bab
Page 509 and 510:
etween those two SS values as a sum
Page 511 and 512:
16. Factorial ANOVA Over the course
Page 513 and 514:
At this point we have 12 different
Page 515 and 516:
drug 2 3.45 1.727 18.6 8.6e-05 ***
Page 517 and 518:
means can be organised into the sam
Page 519 and 520:
attributable to the “ANOVA model
Page 521 and 522:
Only Factor A has an effect Only Fa
Page 523 and 524:
The effect of CBT (difference betwe
Page 525 and 526:
Now, if you go back to the formulas
Page 527 and 528:
Here’s what I mean. Again, let’
Page 529 and 530:
Speaking of interaction effects, he
Page 531 and 532:
16.4 Assumption checking As with on
Page 533 and 534:
16.5.1 The F test comparing two mod
Page 535 and 536:
the degrees of freedom here is 15.
Page 537 and 538:
tfm.2 grade attend reading 1 90 yes
Page 539 and 540:
line, we obtain the following: Ŷ 7
Page 541 and 542:
Estimate Std. Error t value Pr(>|t|
Page 543 and 544:
that contains the same data as clin
Page 545 and 546:
What about the case when there does
Page 547 and 548:
R comes with a variety of functions
Page 549 and 550:
enlightening read. There are a lot
Page 551 and 552:
16.8 Post hoc tests Time to switch
Page 553 and 554:
diff lwr upr p adj anxifree:no.ther
Page 555 and 556:
1 yes none 3.700 2 no none 5.550 3
Page 557 and 558:
Analysis of Variance Table Response
Page 559 and 560:
III tests (which are simple) before
Page 561 and 562:
sugar 2.13 2 4.0446 0.045426 * milk
Page 563 and 564:
even simpler. The main effect of su
Page 565 and 566:
However, when we do the same thing
Page 567:
Part VI. Endings, alternatives and
Page 570 and 571:
first I want to work through a simp
Page 572 and 573:
This is a very useful table, so it
Page 574 and 575:
P pd|hqP phq P ph|dq “ P pdq And
Page 576 and 577:
Bayesian approach relative to the o
Page 578 and 579:
hardly unusual: in my experience, m
Page 580 and 581:
Cumulative Probability of Type I Er
Page 582 and 583:
are wrong. Worse yet, because we do
Page 584 and 585:
library( BayesFactor ) # ...because
Page 586 and 587:
make sense to people who already un
Page 588 and 589:
-------------- [1] Non-indep. (a=1)
Page 590 and 591:
Before moving on, it’s worth high
Page 592 and 593:
17.6.2 The Bayesian version Okay, s
Page 594 and 595:
[2] day : 0.2724027 ˘0% [3] baby.s
Page 596 and 597:
therapy 0.4672 1 8.5816 0.01262 * d
Page 598 and 599:
command that I use when I want to g
Page 600 and 601:
- 586 -
Page 602 and 603:
• Other types of correlations. In
Page 604 and 605:
measurements mean that independence
Page 606 and 607:
the list that I’ve given above is
Page 608 and 609:
So if you were forced to guess my w
Page 610 and 611:
to be able to do more than ANOVA, m
Page 612 and 613:
Fisher, R. A. (1922a). On the inter
show all

Learning Statistics with R - A tutorial for psychology students and other beginners, 2018a

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?