Mike Scott's Web
WordSmith Tools University of Liverpool Publications Language Awareness Contact & Links
You are at: Home > WordSmith Tools > FAQs > Answers

Frequently Asked Questions

Can WordSmith handle Language X (eg. Igbo)?

There are 3 issues here:
a) processing a string of text and for example finding word breaks, or seeking concordance lines
b) showing results on a screen and in printed form
c) sorting lists of words in alphabetical order.


processing a string of text

In a pc, each symbol is only a number such as 5,793 or 65, so in principle any writing system can be accommodated. In WordSmith, recognising word-breaks is done by assuming there is a space or some such between words. This does not apply to all languages: in particular, Chinese and Japanese do not usually mark word-breaks as English does using spaces (code 32). That means that concordancing should work OK (since we are simply searching for a string of characters) but word-listing is not possible (though character-listing is).

showing results

You need a font which translates the numbers into symbols in pixels. In WordSmith 3, the current version, a user can define their own alphabet and own alphabetical order but there are problems in getting a Windows pc to show things correctly.
WS3 uses alphabets which can be represented in one byte (a number between 0 and 255), a system which was usual in computers until recently. In practice with such a one-byte representation system there needs to be a "codepage" (a table of 256 characters) which contains the symbols you need. I do not believe there is one suitable for Igbo but I am not sure: presumably computational linguists in Nigeria will know. The Windows Eastern Europe codepage covers some but probably not all.

WS4 (the one I'm working on now) can use the old one-byte system still, but also 2-byte Unicode. This contains a very large number of symbols indeed, many thousands of them. For Unicode you will need a Unicode font. Check "Unicode" in a search-engine if this is new to you.

sorting words

For this, you need a principle for saying whether (for example) an Igbo sub-dotted o is to belong with the other o's or whether it should come after z. WS4 now uses the sorting routines provided by Microsoft instead of my own. The user has to specify which language they are using and Microsoft's sort routines for English, Portuguese, Japanese, French, Russian etc do the sorting. I do not think Igbo is included in the list, but this would only affect the sorting and might not be important in practice. The best thing to do is to tell WS4 that the language you are processing is one which does sort acceptably for your data: this might be Hungarian or Spanish, you will have to experiment to see.

Does WordSmith tags texts?

No. You have to tag your own manually or use a tagger to tag them automatically.

Does WordSmith come with a corpus?

No. I cannot legally supply you with a body of texts. But you can easily build up your own using Internet resources. There are lots of corpora, some of which are freely accessible, others can be purchased cheaply, and others are extremely expensive. Try a google search on "test corpus". Or visit newspaper web sites.

My registration code didn't work...

The beta version of WS4 uses the name "temporary use" (not your own name or "temporary user") and a code supplied in the readme.txt file. Run updater.exe after installation to register.

For WS3 registration problems contact Oxford University Press.