WordSmith Tools Manual

Navigation: Reference > Definitions

Definitions

valid characters

Valid characters include all the characters in the language you are working with which are defined (by Microsoft) as "letters", plus any user-defined acceptable characters to be included within a word (such as the apostrophe or hyphen). That is, in English, A, a,... Z, z will be valid characters but ; or @ or _ won't. In Greek, δ will count as a valid character. In Thai, ฏ (to patak) will be a valid character.

words

The word is defined as a sequence of valid characters with a word separator at each end.

A word can be of any length, but for one to be stored in a word list, you may set the length you prefer (maximum of 50 characters) -- any which exceed your limit will get + tagged onto them at that point. You can decide whether or not to include words including numbers (e.g. $35.50) in text characteristics.

token and type

The term token is used to refer to running words and type to different words. So in This is my book, it is interesting we have 7 tokens but only 6 different types because is gets repeated.

clusters

A cluster is a group of words which are found repeatedly together in each others' company, in sequence. The term phrase is not used here because it has technical senses in linguistics which would imply a grammatical relation between the words in it. In WordList cluster processing or Concord cluster processing there can be no certainty of this, though clusters often do match phrases or idioms. See also: general cluster information.

sentences

The sentence is defined as the full-stop, question-mark or exclamation-mark (.?!) and (equivalents in languages such as Arabic, Chinese, etc.) immediately followed by one or more word separators and then a number or a currency symbol, or a letter in the current language which isn't lower-case. Note: languages which do not distinguish between lower-case and upper-case characters do not technically count any as lower case or upper case. (For more discussion see Starts and Ends of Text Segments or Viewer & Aligner technical information.)

paragraphs

Paragraphs are user-defined. See Starts and Ends of Text Segments for further details.

headings

Headings are also user-defined -- see Starts and Ends of Text Segments.

texts

A text in WordSmith means what most non-linguists would call a text. In a newspaper, for example, there might be 6 or 7 "texts" on

each page. This also means that a text = a file on disk. If it doesn't you're better off totally ignoring the "Texts" column in any WordSmith output.

chargrams

A chargram is a sequence of N consecutive valid characters (excluding digits and punctuation) found in text. e.g. ABI,ABL,ABO etc. In English the most frequent 3-chargrams are THE, ING, AND, ION.