thinking about keyness

Choosing a reference corpus

 

In general the choice does not make a lot of difference if you have a fairly small p value (such as 0.000001). But it may help to think using this analogy.

 

Different reference corpora may give different results. Suppose you have a method for comparing objects and you take a particular apple out of your kitchen to compare using it

A) with a lot of apples in the green-grocer's shop

B) with all the fruit in the green-grocer's shop

C) with a mixture of objects (cars, carpet, notebooks, fruit, elephants etc.)

 

With A) you will get to see the individual characteristics, e.g. perhaps your apple is rather sweeter than most apples. (But you won't see its "apple-ness" because both your apple and all the others in your reference corpus are all apples.)

With B) you will see "appleness" (your apple, like all apples but unlike bananas or pineapples, is rather round and has a very thin skin) but might not see that your apple is rather sweet and you won't get at its "fruitiness".

With C) you will get at the apple's fruity qualities: it is much sweeter and easier to bite into than cars and notebooks etc.

 

Keyness scores

Is there an important difference between a key word with a keyness of 50 and another of 500?

 

Suppose you process a text about a farmer growing 3 crops (wheat, oats and chick-peas) and suffering from 3 problems (rain, wind, drought). If each of these crops is equally important in the text, and each of the 3 problems takes one paragraph each to explain, the human reader may decide that all three crops are equally key and all three problems equally key. But in English these three crop-terms and weather-terms vary enormously in frequency (chick-peas and drought least frequent). WordSmith's KW analysis will necessarily give a higher keyness value to the rarer words. So it is generally unsafe to rely on the order of KWs in a KW list.

 

 

 

Click the Permalink button if you want to copy a link to this page.