The "key words" are calculated by comparing the frequency of each word in the word-list of the text you're interested in (study corpus) with the frequency of the same word in another word-list (comparison corpus). All words which appear in the smaller list are considered, unless they are in a stop list.
If the occurs say, 5% of the time in the study corpus and 6% of the time in the comparison corpus, it will not turn out to be "key", though it may well be the most frequent word. If the text concerns the anatomy of spiders, it may well turn out that the names of the researchers, and the items spider, leg, eight, etc. may be more frequent than they would otherwise be in your comparison corpus (unless your comparisonr corpus only concerns spiders!)
To compute the "keyness" of an item, the program therefore computes
its frequency in the small word-list
the number of running words in the small word-list
its frequency in the other corpus
the number of running words in the comparison corpus
and cross-tabulates these.
A word will get into the listing here if it is unusually frequent (or unusually infrequent) in comparison with what one would expect on the basis of the comparison word-list.
Unusually infrequent key-words are called "negative key-words" and appear at the very end of your listing, in a different colour. Note that negative key-words will be omitted automatically from a keywords database and a plot.
text dispersion keyness
Egbert and Biber (2019) propose that text dispersion key words can be computed by comparing the number of texts each word is found in in both the study corpus and the reference corpus (instead of comparing word frequencies).
Statistical tests
Three statistical tests are computed:
•Ted Dunning's Log Likelihood test, which measures keyness in terms of the statistical significance and is considered more appropriate than chi-square, especially when contrasting long texts or a whole genre against your reference corpus. •Log ratio: Andrew Hardie's procedure emphasizing the size of of the keyness as opposed to its statistical significance (related to the %DIFF procedure from Costas Gabrielatos & Anna Marchi but which produces smaller numbers and easier to understand). A value of 2 means the item is 4 times more frequent in the small word list than in the comparison corpus list. A value of 3 means it's 8 times more frequent, and of 4 means it's 16 times more frequent. •BIC Score. Effectively an alternative to p scores. It uses the log likelihood score and the size of the two corpora in its formula. You can leave your p value at 0.1 if you use BIC scores to assess keyness. That will help especially where the comparison corpus is fairly small, as it will tend to bring up more negative key words reflecting the nature of the comparison corpus. Costas Gabrielatos (2018) suggests that BIC scores can be interpreted thus:
below 0
|
not trustworthy
|
0-2
|
only worth a bare mention
|
2-6
|
positive evidence
|
6-10
|
strong
|
more than 10
|
very strong
|
See UCREL's log likelihood site and SigEff calculator for more on these.
Words get accepted as key if they pass all statistical tests. If you want to ignore the Log ratio test, set its minimum to 0 in the settings. If you want to use the BIC score or simply ignore the Log Likelihood test, set the p value to 0.1.
Words which do not occur at all in the comparison corpus are treated as if they occurred 5.0e-324 times (0.0000000 and loads more zeroes before a 5) in such a case. This number is so small as not to affect the calculation materially while not crashing the computer's processor.
|