Saturday, 23 April 2016

Pooled standard deviation: a convenient fraudulence?

Convenient assumptions in statistics could mislead science. The myth that every other phenomenon in nature follows its favorite Gaussian distribution was busted with series of eye opening discoveries.  Another assumption that statistical analysis sometimes suffer is the  equal or equivalent variance of multiple samples. Why do authors make this assumption is sometimes mysterious. I have come across some such articles, where such an assumption could, supposedly,  alter the statistical significance of the observation. They calculate pooled standard deviation of all their sample vectors and attach the pooled-SD bars on their plots. Accordingly, they do t-test with equal  variance assumption. Here is an example:













The left plot is with actual SD and the right plot is with the pooled-SD. It is the right plot which the authors claim in their articles, without the mentioning of actual variances and lack of variance difference therein. Now, if we perform two-tailed t-test for each scenario assuming unequal and equal variance respectively, it throws p-values of 0.06 and 0.04 respectively. Based on whatever is convenient to the authors' hypothesis, the p-value is then projected accordingly. This is some random example I could quickly work out for sample size of 60 to show that t-statistics can be sensitive in the context. The actual datasets might exhibit more sensitivity than shown. Where should pooled variance then be used? It was primarily meant to obtain SD for the samples where getting actual SDs are experimentally very expensive. For example, if y is measured as a function x and measuring y is expensive, then to obtain SD for y at each x value, on e can measure multiple y for certain equally spaced x values . SDs for those y can be calculated, which can then be pooled to get pooled-SD at each x. One, ofcourse, needs to assess whether those actual SDs were derived from equivalent variance.