Only X% of words found in reference corpus

Only X% of words found in reference corpus

When WordSmith computes key words it checks to see that most of the words in your small word-list are found in the reference corpus, as would be expected. If less than 50% are found, you will get this warning. That is a bit unusual, and is supplied as a warning that for example there might be something strange about one of your two texts. If you know there is nothing strange, then you could ignore the message.  

If you are processing clusters you are much more likely to see this warning, however, as the chance of 3-word strings matching in the two lists is less than that of single words matching.

It is up to you to decide whether there is some error in what you are doing or it is OK for many of your smaller word list's words/clusters not to be found in the reference corpus word list.

It might not be so unusual if your reference corpus was very small. But if it is indeed very small, the whole procedure is not very reliable. WordSmith simply looks at the frequencies of each word form and uses basic statistics to compute how greatly they differ in frequency. Basic statistics rely on a notion of what can be expected. If the reference corpus is incredibly small, WordSmith's computation of what is to be expected isn't really very reliable. As a dumb example if you met three citizens of a country you have never visited, and all looked fat, you might suppose the people of that country to be fat in general, but the sample size is not reliable for such an expectation. The KW procedure isn't really proof of anything, incidentally. Words don't occur in texts at all randomly and all ordinary basic statistics can do in my opinion is give us food for thought. So a KW listing isn't proof of anything but it may well give good ideas as to what may prove interesting avenues for research.

 

 

Click the Permalink button if you want to copy a link to this page.