My very first R analysis

I’m planning on being an R expert someday. For now, I’m a very lost newbie taking stabs in the dark. Fortunately, I have some very clever people at arm’s length to badger for advice whenever I get stuck (big ups to Caitlyn for indulging all of my questions). I thought I would share a bit of my learning experience with conducting an ANOVA with one between-subjects factor (Condition), and one within-subjects factor (Block). I think this will be particularly useful for fellow undergraduates in our lab who are just starting to build their data-analysis skill set, but maybe others will find it useful too. I want to show you what didn’t work, in case you encounter the same thing as I did, and what did work.

My current levels of expertise: I’m very comfortable in Matlab, but I’m still learning how to use R properly. I’m not yet a statistics expert, either. Only the basic concepts I still encounter regularly as a research assistant (t-tests, simple correlations, power calculations) have stuck with me since my two statistics classes in 2003 and 2009. I’m reviewing my old textbook in preparation for graduate school, brushing up on the more advanced basics as I work through R. If it seems to you that something should be obvious to me as we work through this example, it probably should be. I’m learning about ANOVAs all over again at the same time I’m learning about R.

The dataset in question: People sorted images of cells into four categories. I had two conditions: one where each category group appeared at the same rate (the “equal” condition), and one where category groups appeared at different rates (the “unequal” condition). Each block of the experiment is 60 trials, for a total of 240 trials in 4 blocks. The score I calculate is the difference of a subject’s proportion of trials that begin with fixations to one feature and proportion of trials that begin with a fixation to another feature.

Here’s what I did, where I got confused, and how I think I may have gone right. I’ll update this post later once I know more about R and can fill in information about why things work they way they do.

First, I downloaded RStudio.

You don’t have to download RStudio, but I find it very useful to have all of R’s windows in a single window with four panes. It makes the R environment look a little more like Matlab. Importantly, it ensures that the help screen is always right beside my console, and I need that. I find it especially useful because I’m using a laptop and I find it cumbersome to be constantly searching around for the help window.

Next, I got my data ready to import into R.

To do an ANOVA in R, you need to format your data so that each row corresponds to an observation. Using Matlab, I created a matrix variable set up like that. I copied and pasted it into a text editor, where it shows up already in tab-delimited format. I gave it headings and saved it as differenceScoresForR5to1.txt. It looks like this:

condition    subject    block    score
0    1    1    0.142857142857143
0    1    2    0.195121951219512
0    1    3    0.250000000000000
0    1    4    0.315789473684211
0    2    1    -0.181818181818182
0    2    2    0.342105263157895
0    2    3    0.377777777777778
0    2    4    0.514285714285714
0    3    1    0.547169811320755
0    3    2    0.588235294117647
1    89    3    0.863636363636364
1    89    4    0.820000000000000


Condition 0 = equal, 1 = unequal. There are four blocks for each subject, and one mean for each subject for each block. I have 44 people in the equal condition, and 45 people in the unequal condition.

I used SPSS to get some results.

Because I don’t know what I’m doing in R, I thought I ought to have something to compare my output to. I used SPSS to get the results I would be trying to replicate in R.

I loaded my data into R.

I loaded the data into R using the command:

 > data5to1 <- read.table('differenceScoresForR5to1.txt', h=TRUE)


I used the read.table command primarily because it’s what I’ve seen other people in my lab do. h=TRUE (alternatively and identically, header=TRUE) tells R that I have headers in this file. This formats my data as a dataframe. Don’t believe me? Check it out:

 [1] TRUE


Using Dr. Kabacoff’s Quick-R guide for ANOVAs as a starting point, I did the following call:

 > summary(aov(score~(block*condition)+Error(subject/block)+condition, data=data5to1))


And got the following results:

 Error: subject
            Df Sum Sq Mean Sq
 condition   1 2.8431  2.8431

 Error: subject:block
       Df Sum Sq Mean Sq
 block  1 4.8439  4.8439

 Error: Within
                  Df  Sum Sq Mean Sq F value   Pr(>F)
 block             1   3.616  3.6157 10.9218 0.001049 **
 condition         1   1.391  1.3911  4.2021 0.041120 *
 block:condition   1   0.836  0.8361  2.5256 0.112912
 Residuals       350 115.868  0.3311
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


No need to check this output against my SPSS output – the first thing that stuck out was the degrees of freedom indicated for block. I have four blocks, so this ought to be 3, not 1. Now, it could be that I’m using the wrong equation for my ANOVA – that’s entirely likely. But it seemed to me that R doesn’t know about my factors.

I told R about my factors.

I found a useful tutorial over at UCLA’s R Learning Modules about factor variables. Before this, though, I was using the factor function like this:

> badConditionFactor = factor(data5to1$condition, levels = 2)
> badBlocksFactor = factor(data5to1$block, levels = 4)


Conducting an analysis with this mistake yields the following error:

 > summary(aov(data5to1$score~(badBlocksFactor*badConditionFactor)+Error(data5to1$subject/badBlocksFactor)+badConditionFactor))
 Error in `contrasts<-`(`*tmp*`, value = "contr.helmert") :
  contrasts can be applied only to factors with 2 or more levels


It turns out this is NOT how to think about the levels input argument when using factor. You see, badConditionFactor is a bunch of NAs:

 > badConditionFactor
 [1] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
 [17] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
 [353] <NA> <NA> <NA> <NA>
 Levels: 2


badBlocksFactor is a bunch of NAs and a 4:

> badBlocksFactor
 [1] <NA> <NA> <NA> 4    <NA> <NA> <NA> 4    <NA> <NA> <NA> 4    <NA> <NA> <NA> 4  
[353] <NA> <NA> <NA> 4
Levels: 4


That’s not how levels works. You’d know that if, unlike me, you read the help file carefully instead of assuming what kind of input it must be looking for. Here’s what I did after more careful reading:

 > conditionFactor = factor(data5to1$condition, labels = c('equal', 'unequal'))
 > conditionFactor
 [1] equal   equal   equal   equal   equal   equal   equal   equal   equal   equal 
 [171] equal   equal   equal   equal   equal   equal   unequal unequal unequal unequal
 [351] unequal unequal unequal unequal unequal unequal
 Levels: equal unequal


In fact, I believe simply “conditionFactor = factor(data5to1$condition)” will sort out that this is a two-factor variable. I just like the labels because it helps me understand what’s going on. For example, after calling “blocksFactor = factor(data5to1$block, labels = c(‘B1’, ‘B2’, ‘B3’, ‘B4’))” I can do:

> table(conditionFactor, blocksFactor)

 conditionFactor B1 B2 B3 B4
         equal   44 44 44 44
         unequal 45 45 45 45


I can see that there is the correct number of subjects in each condition in each block.

After also doing “subjectFactor = factor(data5to1$subject)”, I did another ANOVA with the newly defined factor variables.

> summary(aov(data5to1$score~(blocksFactor*conditionFactor)+Error(subjectFactor/blocksFactor)+conditionFactor))

 Error: subjectFactor
                 Df Sum Sq Mean Sq F value  Pr(>F) 
 conditionFactor  1  4.202  4.2024   3.904 0.05134 .
 Residuals       87 93.649  1.0764                 
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

 Error: subjectFactor:blocksFactor
                               Df  Sum Sq Mean Sq F value    Pr(>F)   
 blocksFactor                   3  8.2864 2.76214 32.4364 < 2.2e-16 ***
 blocksFactor:conditionFactor   3  1.0350 0.34498  4.0512  0.007727 **
 Residuals                    261 22.2256 0.08516                     
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


These results generally match my SPSS output, but I still find differences in some places – for example, my R output says for condition, df= 1, Mean Square = 4.2024, F = 3.904, p = 0.05134. My SPSS output suggests Mean Square is 1.051, but the df, F, and p values are identical. The reason for these differences requires a bit more learning about ANOVAs, which is my next step. My other next step is to talk to Bill (our resident Everything expert) about the steps I took in R so I can learn more about what’s going on and begin to feel confident about using R without checking my output against the output of other statistical packages (for example, is there a way to read in data and also tell R about factors in one step?).

Thanks for bearing with me, folks! Sorry if I’ve made expert users (of both R and statistics) cringe. My goal for this post is to help other people in our lab new to this world feel comfortable exploring these things.


I have a bit more information for you in this more recent post.