I've come to like

R quite a bit for statistical computing, even though as a language it can be rather quirky. (Case in point: The

anova() function compares two or more models using analysis of variance; if you want to fit an

ANOVA model, you need to use the

aov() function.) I don't use it that often, though, which is a mixed blessing. The bad new is that infrequent use makes it hard for me to remember everything I've learned about the language. The good news is that infrequent use means I'm not having to do statistical analysis very often.

I don't think I'm alone in believing that consistent coding patterns (paradigms, idioms, whatever you want to call them) are very helpful when using a language infrequently. That motivates today's post, on testing significance of a regression model. By model significance, I mean (in somewhat loose terms) testing

**H0**: the null model (no predictors other than a constant term) fits the data at least as well as our model

versus

**H1**: our model fits the data better than the null model.

When performing a standard linear regression, the usual test of model significance is an

F-test. As with most (all?) statistics packages, R helpfully prints out the

*p*-value for this test in the summary output of the regression, so you can see whether your model is (literally) better than nothing without any extra work. To test whether a second model (call it

model2) improves on model 1 significantly, you use the anova() command:

anova(model1, model2)

which is easy enough to remember.

When performing a

generalized linear regression, however, R does not automatically give you a model significance test. I'll focus here on a

binary logit model (dependent variable binary), but I'm pretty sure the various approaches apply to other uses of the GLM, perhaps with some tweaks.

Let's say that

model1 is a binary logistic regression model I've fitted in R. The most common test for significance of a binary logistic model is a

chi-square test, based on the change in

deviance when you add your predictors to the null model. R will automatically calculate the deviance for both your model and the null model when you run the

glm() command to fit the model. The approach to testing significance that I've seen on a number of web pages, including

this one, involves calculating the p-value manually, using some variation of the following syntax:

with(model1, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))

That's fine (nice and compact), but prospects of my remembering it are slender at best. Fortunately, we can use the aforementioned

anova() command by manually fitting the null model. First rerun the logistic regression using just a constant term. Call the resulting fit

null.model. Now compare

null.model to

model1 using the

anova() command, adding an argument to tell R that you want a chi-square test rather than the default F test:

anova(null.model, model1, test = "Chisquare")

You can also use the same syntax to compare two fitted logistic models for the same data, say where

model2 adds some predictors to

model1. For me, that's a lot easier to remember than the manual approach.

Here's some heavily annotated code (or you can

download it), if you want to see an example: