Authors: C.J. Burden, J. Jing and S.R. Wilson
The D2 statistic, defined as the number of matches of words of some pre-specified length k, is a computationally fast alignment-free measure of biological sequence similarity. However there is some debate about its suitability for this purpose as it may be susceptible to single-sequence noise. We examine the extent of the problem and the effectiveness of overcoming it by using a mean-centred version of the statistic. We conclude that the D2 statistic is a useful measure of sequence similarity which can easily be extended to a mean-centred version which may perform better in some situations. Both the D2 statistic and its mean-centred version are well approximated by Gamma random variables under an i.i.d. null hypothesis, allowing for an accurate estimation of P-values.