WordSmith Tools Manual

Navigation: Reference

Formulae

For computing collocation strength, we can use

•the joint frequency of two words: how often they co-occur, which assumes we have an idea of how far away counts as "neighbours". (If you live in London, does a person in Liverpool count as a neighbour? From the perspective of Tokyo, maybe they do. If not, is a person in Oxford? Heathrow?)

•the frequency word 1 altogether in the corpus

•the frequency of word 2 altogether in the corpus

•the span or horizons we consider for being neighbours

•the total number of running words in our corpus: total tokens

Mutual Information

Log to base 2 of (A divided by (B times C))

where

A = joint frequency divided by total tokens

B = frequency of word 1 divided by total tokens

C = frequency of word 2 divided by total tokens

MI3

Log to base 2 of ((J cubed) times E divided by B)

where

J = joint frequency

F1 = frequency of word 1

F2 = frequency of word 2

E = J + (total tokens-F1) + (total tokens-F2) + (total tokens-F1-F2)

B = (J + (total tokens-F1)) times (J + (total tokens-F2))

T Score

(J - ((F1 times F2) divided by total tokens)) divided by (square root of (J))

where

J = joint frequency

F1 = frequency of word 1

F2 = frequency of word 2

Z Score

(J - E) divided by the square root of (E times (1-P))

where

J = joint frequency

S = collocational span

F1 = frequency of word 1

F2 = frequency of word 2

P = F2 divided by (total tokens - F1)

E = P times F1 times S

Dice Coefficient

(J times 2) divided by (F1 + F2)

where

J = joint frequency

F1 = frequency of word 1 or corpus 1 word count

F2 = frequency of word 2 or corpus 2 word count

Ranges between 0 and 1.

Log Likelihood (different corpora)

where

a = frequency of term 1

b = frequency of term 2

c = total words in corpus 1

d = total words in corpus 2

computes

E1 = c*(a+b) / (c+d) and E2 = d*(a+b) / (c+d)

Log Likelihood is

2*((a* Log (a/E1)) + (b* Log (b/E2)))

(using Log to the base e)

BIC Score

is the log likelihood above - Log(c+d).

Log Likelihood (same corpus)

uses

J = joint frequency

F1 = frequency of word 1 or corpus 1 word count

F2 = frequency of word 2 or corpus 2 word count

T = total word count

then computes K11 = Joint; K12 = F1 * collocation span - Joint; K21 = F2 - Joint; K22 = T - F1 - F2 - Joint

as input to a routine explained at Ted Dunning's blog. The use of the collocation span is proposed by Stefan Evert.

Log Ratio

where

a = frequency of term 1

b = frequency of term 2

c = total words in corpus 1

d = total words in corpus 2

computes

Log ((a/c) / (b/d))

(using Log to the base 2)

Conditional Probability (Durrant 2008: 84)

divides the frequency of terms 1 and 2 when together (the joint frequency) by the frequency of term1 (Conditional Probability A) or of term 2 (Conditional Probability B) and multiplies by 100 for better legibility.

Delta Probability (Gries, 2013: 144)

where

j = joint frequency of term 1 with term 2

a = frequency of term 1

b = frequency of term 2

c = total words in corpus

computes (j / b) - ((a-j) / (c-a-b+j) and multiplies the result by 100 for legibility. Very similar to Conditional Probability.

Dispersion (Oakes p. 190)

where

n = number of divisions

m = mean of the frequencies over n divisions

sd = standard deviation of the frequencies

v = sd / m

r = square root of n

computes dispersion as 1 - (v / r)

(Oakes suggests square root of n-1 but square root of n gives slightly better results. Either way he says this is designed to range between 1 and 0 but in practice a very low dispersion such as where all the hits are in one division can compute to less than zero. WordSmith will show results of zero or below as blanks.)

Relative Entropy (Gries, 2010)

where

n = number of measurements

p = the probability of each measurement

computes entropy as the positive sum of (each p * log2 of p) and relative entropy as entropy / log2 of n