Alternative formats for the BNC

A corpus like the BNC may be usefully converted in three or four ways:

1.	In a format which Windows will expect, preferably with a .txt filename so that Windows will open each text easily

2.	with the files all stored in folders whose names mean something useful

3.	in Unicode, a format which handles all the curly quote marks and dashes unambiguously

4.	optionally you may also want a markup-free copy so you can read the texts easily.

In WordSmith, use Text Converter for this.

I selected these texts (the XML edition has 3 main folders; the ones needed for WordSmith are in the \texts folder)

choosing_bnc_xml_texts_In_TextConverter

then into Unicode and dealing with curly quotes etc. as checked below:

text_converter_converting_BNC_XML

The above screenshot was taken as the processing was being done; it took about an hour as there are many thousands of text files. Then I filtered them according to Dave Lee's codes so as to get them into folders that mean something to me!

text_converter_filtering_all_bnc_xml_classcodes

That took another hour, working across a home network.

Page url: http://www.lexically.net/wordsmith/Handling%20BNC/index.html?alternative_formats_for_bnc.htm