Skip to main content

Why are my degrees of freedom wrong?

Objective:  You are running a linear model, for example ANOVA or regression, and are using the "ANOVA table" to decide which terms in the model are influencing the dependent variable.  You check the numerator and denominator degrees of freedom, as recommended, to guard against modeling errors and use of wrong error terms.  Reported values disagree with what you expected, so now what?

Make sure your expected numbers are calculated correctly:
An example model is (last term is the residual error)
[Model 1]       y = u + block + treat + block*treat + rep(block*treat)
If numbers of levels are b=2 for blocks, t=2 for treats, and r=5 for reps, then we expect degrees of freedom (DF) to be
DF[block] = b-1 = 1
DF[treat] = t-1 = 1
DF[block*treat] = (b-1)*(t-1) = 1*1 = 1
DF[rep(block*treat)] = (r-1)*(b)*(t) = 4*2*2 = 16
Number of observations is b*t*r = 20, and DF add to 19, which is 20 minus the one DF for the intercept, as expected.
DF rules are:
1) DF for single factor terms are number of levels minus 1
2) DF for terms inside of nesting are number of levels
3) DF for multiple factor terms are the product of the above components.
4) Not illustrated above, but DF for any regression terms are the number of slopes included in that term.  (DF are a count of the number of parameters, which equals the number of slopes for regression terms)
5)  Total DF equals the number of observations (n).  Most models have an intercept term not listed in the ANOVA table, so "Corrected Total DF" is n-1 since the intercept is one parameter.

Make sure you are correctly matching DF for terms:
If block and rep in Model 1 are random, the ANOVA table expected from current mixed model software is
Source     NumDF     DenDF     F      P-value
Treat            1                1          x.xx  0.xxx
Only fixed effects are included, since F-tests for random effects are invalid.  NumDF is short for numerator DF, the degrees of freedom for the listed source of variation.  DenDF, or denominator DF, is the DF for the error term used to test the listed source.  Both DF are used to compute the P-value for the F-test, thus the recommendation to check that DF are the expected (correct) values.  From above, DF[treat]=1, so NumDF appear correct, but why is DenDF=1 and not 16?  You need to know how error terms are chosen.
Error term rule is:  correct error term is the simplest random term that contains the source being tested.  In Model 1 we have 3 random terms (any term with a random factor like block is always modeled as random), but 2 contain treat, and block*treat is the simpler compared to rep(block*treat), thus the correct error term.  Since DF[block*treat]=1, the DenDF are as expected.

The above is all correct, but my DF still disagree with the software:
1)  Check for missing observations.  The DF rules depend on complete balance, for example 5 reps in every treat, and 2 treat in every block.  Number of observations should be the product of factor levels, 2*2*5=20.  If one treatment in one block is missing one rep, then 1 DF is lost (each observations adds one DF to the model, assuming independence), and DF[rep(block*treat)]=4+4+4+3=15 (5-1=4 for each block*treat with 5 reps, and 4-1=3 for the block*treat with one missing rep).  If one of the blocks is completely missing a treat, then again 1 DF is lost, and DF[block*treat]=0 (there are 3 block*treat combinations, thus 2 DF, and both are assigned to block and treat terms, leaving no DF for block*treat).  And that would also mean DF[rep(block*treat)]=4+4+4=12, because we lost 5 reps, or 5-1 DF. There are no easy DF rules when data are unbalanced, but generally you can add DF from components of the term to get the correct DF.
2)  Check for missing terms in the model.  If all possible combinations of factors are not terms in the model, then DF for the missing terms are pooled into the DF for the next simplest term containing the missing term.  A typical example is
[Model 2]     y = u + block + treat + rep(block*treat)
Note the block*treat term is a possible combination of block and treat factors, but is missing from the model.  DF[block*treat] will be pooled into the rep term, as it contains block*treat, and thus you would expect DF[rep(block*treat)]=16 + 1 = 17.  Missing terms can be dictated by the design, for example Latin Squares, or might be a deliberate modeling choice to increase power of statistical tests.
3)  Did the software use an adjustment to DF, such as Kenward-Roger?  Adjustments are commonly used, and are recommended (or better yet required) when the model involves repeated measures, or multiple error terms such as for split-plot models.  The easiest practical approach in this case is to rerun the analysis without DF adjustment, as adjustment can drastically change DF higher or lower (even 10-fold changes), and even choose different error terms.  For example, Kenward-Roger will skip over any error term that has zero variance.  Another trick is to run the model with all random effects defined as fixed, thus producing NumDF for all terms in the model, allowing easy checks on DF associated with each.  Once you verify the DF from unadjusted analysis, or all fixed analysis, then it is safe to assume the software will compute DF correctly for the model as originally specified.




Comments

Popular posts from this blog

DANDA - A macro collection for easier SAS statistical analysis

Objective :  You are running ANOVAs or regressions in SAS, and wish there was a way to avoid writing the dozens of commands needed to conduct the analysis and generate recommended diagnostics and summary of results, not to mention the hundreds of possible options that might be needed to access recommended methods.  A possible solution is to download a copy of danda.sas below, and use this macro collection to run the dozens of commands with one statement.  We will also have future posts covering various uses of danda.sas, giving examples as always. danda.sas is under continued development, check this page for updates. Date                       Version               Link 2021/03/15             2.12.030          danda.sas 2021/03/15             2.12                UserManual.pdf     2012/08/30                 2.11                danda211.sas Example :  You have an RBD split-plot design, so typical SAS code for mixed model ANOVA is proc mixed data=one;   class block treat week;   m

UTF character data, encoding of text

Objective and Background :  You have text data that is UTF encoded and need SAS/R to read and write datasets with that encoding.  If you have ever printed or viewed text information, and seen something like Giuffr?Ÿ’e?ƒe?Ÿƒ?ÿ?›ƒ?ªƒ?›?Ÿ’e›ƒ?ª­?Ÿƒeee, then you are running into this encoding issue.  Computers store text using numbers, with each number assigned to a particular character.  See  https://en.wikipedia.org/wiki/ASCII  to find that the character & is stored as 38 when using the ASCII encoding.  Unicode is popular internationally because it encodes special characters such as accented letters, and UTF-8 is a widely used version ( https://en.wikipedia.org/wiki/UTF-8 ).  In UTF-8 the & character is stored as 26, and you can imagine how the jumbled example above arises from the confusion of what letters are being stored. Solution 1 :  Use options to request that individual datasets be read and written in a particular encoding.  In SAS, specify encoding options on the vario

Obtain coefficients for orthogonal polynomial contrasts (SAS and R)

Objective : We are comparing means using ANOVA, and our treatment levels are amounts of something.  Thus regression hypotheses may shed light on how the treatments differ, for example is there an overall linear trend for the response variable to increase or decrease with treatment level.  This is addressed by adding orthogonal polynomial contrasts to our ANOVA, which may require that we add contrast coefficients. Example :  Treatments are amounts of corn in the diet, specifically 62%, 65%, 68%, 71% and 74%. SAS :  IML product has an orthogonal polynomial calculator.  Additional code here attempts to make the coefficients whole numbers by dividing by the smallest non-zero number.  Note IML may not be available, depending on your license. proc iml; trtlevels={0.62, 0.65,0.68,0.71,0.74}; **this is only user input; ntrt=nrow(trtlevels); coeff=orpol(trtlevels); coeff = coeff[,2:ntrt]; div=abs(coeff); zerloc=loc(div<1e-14); if nrow(zerloc)>0 then div[zer