## Statistical analysis of linguistic data using aRgh!

When I did my PhD, I analysed all of my data in SPSS, simply because it was the only statistical package that I knew about at the time. Later on in my PhD, I was introduced to Goldvarb, but unfortunately, that was pretty hopeless for me because I was analysing continuous data (vowel measurements) and Goldvarb could only be used to analyse categorical (either/or) data, so things like the presence or absence of /r/ for example. So Goldvarb sat on my computer, unexplored for the duration of my PhD.

Towards the end of my PhD (i.e. as I finished most of my analysis), I started hearing about this new-kid-on-the-block piece of software called R. I loaded it up and almost as quickly, shut it straight back down again. Although R is incredibly powerful, it is driven purely by command line interface. This is in stark contrast to something like SPSS which is primarily menu-driven (you can use command-line in SPSS, but I didn’t). And if you’ve ever used a command-line interface, you’ll know just how ridiculously non-intuitive such a system is and how hair-tearingly frustrating it is to get the software to do something simple like, oh I don’t know, open a data file… I tried it, hated it and eventually gave up on it.

Once I had finished my PhD, in the hubbub of sociolinguistic chatter, I started hearing that R was ‘the next big thing’ and that it could do far more than SPSS ever could. One of the big advantages R had over SPSS was the fact that it could do mixed-effects model regression rather than just fixed-effects model regression. Importantly, mixed-effects models are a lot more powerful and accurate than fixed-effect models since (as I understand it) it weights the linguistic results against the number of speakers (Daniel Johnson has written an excellent paper on why m-e models are better here).

Or imagine that we have transcribed 1,000 tokens of words with a historical post-vocalic /r/; half the tokens come from men, and half from women. Suppose that 60% of the men’s tokens are /r/-less, compared with 40% of those from women. Given such a distribution, GoldVarb would identify gender as a highly significant factor group. And that conclusion would be perfectly justified if the data came from 40 speakers, 20 men ranging between 45% and 75% /r/-lessness, and 20 women ranging between 25% and 55%. Here, while men and women both show considerable diversity, and some women are even more /r/-less than some men, the men are more /r/-less overall. And with so many speakers in each group, the difference is very unlikely to be due to chance. But if the same 1,000 tokens had come from only eight speakers – four men with 45%, 55%, 65% and 75% /r/-lessness, and four women with 25%, 35%, 45% and 55% – we would not have sufficient evidence for a gender effect. Here, too, the average man is more /r/-less than the average woman, but the number of speakers is small enough that the difference could have arisen by chance (Johnson 2009: 364).

Daniel is one of the champions of using R in sociolinguistics analysis, and I was further convinced by the usefulness of R during the Sociolinguistic Summer School in Glasgow last year. Again, though, I never really had an opportunity to use it since I was up to my eye-balls in discourse analysis, so my quantitative side kind of slipped away a little bit.

But now I’m working on a paper looking at my TH-fronting data in more detail, trying to see whether Community of Practice is a significant predictor over other linguistic constraints in determining whether a speaker would use [th], [f] or [h] in word initial position, and I’ve now bitten the bullet and started trying to learn R properly. I can now (kind of), open a dataset, adjust the dataset, and do a basic m-e linear regression analysis. Part of the problem is, however, that Rbrul (the marriage of Goldvarb and R if you like) can’t deal with a categorical variable with more than two levels, and my variable has three, so now I’m trying to puzzle this out and see whether I can turn it into a continuous variable with three points.

Anyway, the take-home message is R IS HARD (for me at least. Other people have an almost supernatural affinity with it…). What resources have been good for people learning it? Or alternatively, who wants to take my data off my hands and run R on it and let me know the results 😉

– *The Social Linguist*

FYI, SPSS can do mixed-effects regression.

Whether any package in R or elsewhere can do a mixed-effects ordinal or multinomial logistic regression (which you’ll probably need here?) I’m not sure.

DEJ

I think it’s a separate bolt-on which you need to pay for, not part of the basic SPSS set up. I’m currently locked out of my copy of SPSS though, so I can’t check. I think the multi-nomial logistic regression would be the way to go, but I’ll probably use your suggestion and try and build it down to a two-way variable.

Good to hear you’ve stuck with using R, it’s fun once you get over the initial learning curve. There are various mailing lists if you’re getting stuck (http://www.r-project.org/mail.html). I don’t know shit about statistics, so I can’t be much help for you there, but I do know a bit about how to start manipulating datasets. Best of luck.

I had to use R during my Masters degree. If you use it enough it begins to become more and more natural, but mostly I found lunar cycles and horoscopes useful in predicting when it would work and when it wouldn’t. You can pretty much do anything you want to in R – someone will have made a package for it that you can install and use. I had to do a horrible thing called an RLQ analysis. It took me 4 months but I got there eventually. Look it up – great bedtime reading 😉