Skip to main content

Why are my degrees of freedom wrong?

Objective:  You are running a linear model, for example ANOVA or regression, and are using the "ANOVA table" to decide which terms in the model are influencing the dependent variable.  You check the numerator and denominator degrees of freedom, as recommended, to guard against modeling errors and use of wrong error terms.  Reported values disagree with what you expected, so now what?

Make sure your expected numbers are calculated correctly:
An example model is (last term is the residual error)
[Model 1]       y = u + block + treat + block*treat + rep(block*treat)
If numbers of levels are b=2 for blocks, t=2 for treats, and r=5 for reps, then we expect degrees of freedom (DF) to be
DF[block] = b-1 = 1
DF[treat] = t-1 = 1
DF[block*treat] = (b-1)*(t-1) = 1*1 = 1
DF[rep(block*treat)] = (r-1)*(b)*(t) = 4*2*2 = 16
Number of observations is b*t*r = 20, and DF add to 19, which is 20 minus the one DF for the intercept, as expected.
DF rules are:
1) DF for single factor terms are number of levels minus 1
2) DF for terms inside of nesting are number of levels
3) DF for multiple factor terms are the product of the above components.
4) Not illustrated above, but DF for any regression terms are the number of slopes included in that term.  (DF are a count of the number of parameters, which equals the number of slopes for regression terms)
5)  Total DF equals the number of observations (n).  Most models have an intercept term not listed in the ANOVA table, so "Corrected Total DF" is n-1 since the intercept is one parameter.

Make sure you are correctly matching DF for terms:
If block and rep in Model 1 are random, the ANOVA table expected from current mixed model software is
Source     NumDF     DenDF     F      P-value
Treat            1                1          x.xx  0.xxx
Only fixed effects are included, since F-tests for random effects are invalid.  NumDF is short for numerator DF, the degrees of freedom for the listed source of variation.  DenDF, or denominator DF, is the DF for the error term used to test the listed source.  Both DF are used to compute the P-value for the F-test, thus the recommendation to check that DF are the expected (correct) values.  From above, DF[treat]=1, so NumDF appear correct, but why is DenDF=1 and not 16?  You need to know how error terms are chosen.
Error term rule is:  correct error term is the simplest random term that contains the source being tested.  In Model 1 we have 3 random terms (any term with a random factor like block is always modeled as random), but 2 contain treat, and block*treat is the simpler compared to rep(block*treat), thus the correct error term.  Since DF[block*treat]=1, the DenDF are as expected.

The above is all correct, but my DF still disagree with the software:
1)  Check for missing observations.  The DF rules depend on complete balance, for example 5 reps in every treat, and 2 treat in every block.  Number of observations should be the product of factor levels, 2*2*5=20.  If one treatment in one block is missing one rep, then 1 DF is lost (each observations adds one DF to the model, assuming independence), and DF[rep(block*treat)]=4+4+4+3=15 (5-1=4 for each block*treat with 5 reps, and 4-1=3 for the block*treat with one missing rep).  If one of the blocks is completely missing a treat, then again 1 DF is lost, and DF[block*treat]=0 (there are 3 block*treat combinations, thus 2 DF, and both are assigned to block and treat terms, leaving no DF for block*treat).  And that would also mean DF[rep(block*treat)]=4+4+4=12, because we lost 5 reps, or 5-1 DF. There are no easy DF rules when data are unbalanced, but generally you can add DF from components of the term to get the correct DF.
2)  Check for missing terms in the model.  If all possible combinations of factors are not terms in the model, then DF for the missing terms are pooled into the DF for the next simplest term containing the missing term.  A typical example is
[Model 2]     y = u + block + treat + rep(block*treat)
Note the block*treat term is a possible combination of block and treat factors, but is missing from the model.  DF[block*treat] will be pooled into the rep term, as it contains block*treat, and thus you would expect DF[rep(block*treat)]=16 + 1 = 17.  Missing terms can be dictated by the design, for example Latin Squares, or might be a deliberate modeling choice to increase power of statistical tests.
3)  Did the software use an adjustment to DF, such as Kenward-Roger?  Adjustments are commonly used, and are recommended (or better yet required) when the model involves repeated measures, or multiple error terms such as for split-plot models.  The easiest practical approach in this case is to rerun the analysis without DF adjustment, as adjustment can drastically change DF higher or lower (even 10-fold changes), and even choose different error terms.  For example, Kenward-Roger will skip over any error term that has zero variance.  Another trick is to run the model with all random effects defined as fixed, thus producing NumDF for all terms in the model, allowing easy checks on DF associated with each.  Once you verify the DF from unadjusted analysis, or all fixed analysis, then it is safe to assume the software will compute DF correctly for the model as originally specified.




Comments

Popular posts from this blog

DANDA - A macro collection for easier SAS statistical analysis

Objective :  You are running ANOVAs or regressions in SAS, and wish there was a way to avoid writing the dozens of commands needed to conduct the analysis and generate recommended diagnostics and summary of results, not to mention the hundreds of possible options that might be needed to access recommended methods.  A possible solution is to download a copy of danda.sas below, and use this macro collection to run the dozens of commands with one statement.  We will also have future posts covering various uses of danda.sas, giving examples as always. danda.sas is under continued development, check this page for updates. Date                       Version               Link 2021/03/15             2.12.030          danda.sas 2021/03/15       ...

Reporting results from transformed analyses

Objective :  Transformed data, for example log(y), is analyzed to correct normality or equal variance requirements.  But we want to report means and standard errors in the original units. SAS example : data one;  do treat=1 to 3;  do rep=1 to 5;    y=10 + treat+ exp(rannor(111));    logy=log(y);    output;  end;end; run; proc mixed plots=all;   class treat;   model y=treat;   lsmeans treat/pdiff; run; proc mixed plots=all;   class treat;   model logy=treat;   lsmeans treat/pdiff; run; The original data, variable y, might have units of pounds.  If a transformation is needed, we simply calculate a new variable by applying a mathematical function known to improve normality or equal variance, and run the same analysis on the new variable.  Commonly used choices are listed in the second table below. However, looking at the results for both analyses we see treat Mean Y S...

UTF character data, encoding of text

Objective and Background :  You have text data that is UTF encoded and need SAS/R to read and write datasets with that encoding.  If you have ever printed or viewed text information, and seen something like Giuffr?Ÿ’e?ƒe?Ÿƒ?ÿ?›ƒ?ªƒ?›?Ÿ’e›ƒ?ª­?Ÿƒeee, then you are running into this encoding issue.  Computers store text using numbers, with each number assigned to a particular character.  See  https://en.wikipedia.org/wiki/ASCII  to find that the character & is stored as 38 when using the ASCII encoding.  Unicode is popular internationally because it encodes special characters such as accented letters, and UTF-8 is a widely used version ( https://en.wikipedia.org/wiki/UTF-8 ).  In UTF-8 the & character is stored as 26, and you can imagine how the jumbled example above arises from the confusion of what letters are being stored. Solution 1 :  Use options to request that individual datasets be read and written in a particular encodin...