Show/Hide Toolbars

WordSmith Tools Manual

Navigation: Reference

Formulae

Scroll Prev Top Next More

For computing collocation strength, we can use

 

the joint frequency of two words: how often they co-occur, which assumes we have an idea of how far away counts as "neighbours". (If you live in London, does a person in Liverpool count as a neighbour? From the perspective of Tokyo, maybe they do. If not, is a person in Oxford? Heathrow?)

the frequency word 1 altogether in the corpus

the frequency of word 2 altogether in the corpus

the span or horizons we consider for being neighbours

the total number of running words in our corpus: total tokens

 

Mutual Information

 

Log to base 2 of (A divided by (B times C))

where

A = joint frequency divided by total tokens

B = frequency of word 1  divided by total tokens

C = frequency of word 2  divided by total tokens

 

MI3

 

Log to base 2 of ((J cubed) times E divided by B)

where

J = joint frequency

F1 = frequency of word 1

F2 = frequency of word 2

E = J + (total tokens-F1) + (total tokens-F2) + (total tokens-F1-F2)

B = (J + (total tokens-F1)) times (J + (total tokens-F2))

 

T Score

 

(J - ((F1 times F2) divided by total tokens)) divided by (square root of (J))

where

J = joint frequency

F1 = frequency of word 1

F2 = frequency of word 2

 

 

Z Score

 

(J - E) divided by the square root of (E times (1-P))

where

J = joint frequency

S = collocational span

F1 = frequency of word 1

F2 = frequency of word 2

P = F2 divided by (total tokens - F1)

E = P times F1 times S

 

Dice Coefficient

 

(J times 2) divided by (F1 + F2)

where

J = joint frequency

F1 = frequency of word 1 or corpus 1 word count

F2 = frequency of word 2 or corpus 2 word count

Ranges between 0 and 1.

 

Log Likelihood (different corpora)

 

where

 a = frequency of term 1

 b = frequency of term 2

 c = total words in corpus 1

 d = total words in corpus 2

computes

 E1 = c*(a+b) / (c+d) and E2 = d*(a+b) / (c+d)

 Log Likelihood is

 2*((a* Log (a/E1)) + (b* Log (b/E2)))

(using Log to the base e)

 

BIC Score

is the log likelihood above - Log(c+d).

 

Log Likelihood (same corpus)

uses

J = joint frequency

F1 = frequency of word 1 or corpus 1 word count

F2 = frequency of word 2 or corpus 2 word count

T = total word count

 

then computes K11 = Joint; K12 = F1 * collocation span - Joint; K21 = F2 - Joint; K22 = T - F1 - F2 - Joint

as input to a routine explained at Ted Dunning's blog. The use of the collocation span is proposed by Stefan Evert.

 

Log Ratio

where

 a = frequency of term 1

 b = frequency of term 2

 c = total words in corpus 1

 d = total words in corpus 2

computes

 Log ((a/c) / (b/d))

(using Log to the base 2)

 

Dispersion (Oakes p. 190)

where

n = number of divisions

m = mean of the frequencies over n divisions

sd = standard deviation of the frequencies

v = sd / m

r = square root of n

computes dispersion as 1 - (v / r)

(Oakes suggests square root of n-1 but square root of n gives slightly better results. Either way he says this is designed to range between 1 and 0 but in practice a very low dispersion such as where all the hits are in one division can compute to less than zero. WordSmith will show results of zero or below as blanks.)  

 

See also: link 1 from Lancaster University, link 2 from Lancaster, Mutual Information, plot dispersion