Formulae

  Previous topic Next topic JavaScript is required for the print function  

For computing collocation strength, we can use

 

the joint frequency of two words: how often they co-occur, which assumes we have an idea of how far away counts as "neighbours". (If you live in London, does a person in Liverpool count as a neighbour? From the perspective of Tokyo, maybe they do. If not, is a person in Oxford? Heathrow?)
the frequency word 1 altogether in the corpus
the frequency of word 2 altogether in the corpus
the span or horizons we consider for being neighbours
the total number of running words in our corpus: total tokens

 

 

Mutual Information

 

Log to base 2 of (A divided by (B times C))

where

A = joint frequency divided by total tokens

B = frequency of word 1  divided by total tokens

C = frequency of word 2  divided by total tokens

 

MI3

 

Log to base 2 of ((J cubed) times E divided by B)

where

J = joint frequency

F1 = frequency of word 1

F2 = frequency of word 2

E = J + (total tokens-F1) + (total tokens-F2) + (total tokens-F1-F2)

B = (J + (total tokens-F1)) times (J + (total tokens-F2))

 

T Score

 

((X divided by total tokens) - X) divided by (square root of (J))

where

J = joint frequency

F1 = frequency of word 1

F2 = frequency of word 2

X = F1 times F2

 

Z Score

 

(J - E) divided by the square root of (E times (1-P))

where

J = joint frequency

S = collocational span

F1 = frequency of word 1

F2 = frequency of word 2

P = F2 divided by (total tokens - F1)

E = P times F1 times S

 

Dice Coefficient

 

(J times 2) divided by (F1 + F2)

where

J = joint frequency

F1 = frequency of word 1 or corpus 1 word count

F2 = frequency of word 2 or corpus 2 word count

Ranges between 0 and 1.

 

Log Likelihood

based on Oakes p. 170-2.

2 times (

 a Ln a + b Ln b + c Ln c + d Ln d

 - (a+b) Ln (a+b)

 - (a+c) Ln (a+c)

 - (b+d) Ln (b+d)

 - (c+d) Ln (c+d)

 + (a+b+c+d) Ln (a+b+c+d)

 )

where

a = joint frequency

b = frequency of word 1

c = frequency of word 2

d := frequency of pairs involving neither w1 nor w2

and "Ln" means Natural Logarithm

 

See also: this link from Lancaster University, Mutual Information

Page url: http://www.lexically.net/downloads/version5/HTML/?formulae.htm