Show/Hide Toolbars

WordSmith Tools Manual

Navigation: Utility Programs > Text Converter

Text Converter: converting BNC XML version

Scroll Prev Top Next More

 

The British National Corpus is a valuable resource but has certain problems as it comes straight off the cdrom:

 

it is in Unix format

it has entities like é to represent characters like é

its structure is opaque and file-names mean nothing

 

You will find it much easier to use if you

 

convert it to Unicode

filter the files to make a useful structure

 

as explained at http://lexically.net/wordsmith/Handling_BNC/index.html

 

The easiest way to do that is in three stages.

 

Conversion:

 

BNC_XML_conversion_choosing_texts

After choosing the texts,

 

BNC_XML_conversion_1

After that, select the files you have just converted to Windows format (here at J:\temp\BNC_XML_1) and do another conversion:

 

BNC_XML_conversion_2

(you will find the BNC XML categories file in your Documents\wsmith7 folder) and when you press OK you'll be asked something like this

BNC_XML_conversion_confirm

After the work is done you will see the BNC texts copied to a similar structure (in our case stemming from j:\temp)

 

BNC_texts_copied_1

BNC_texts_copied_2

BNC_texts_copied_3

 

Filter

 

Choose the converted texts in the first window:

 

BNC_XML_filter_choosing_texts

de-activate conversion,

BNC_XML_filter_deactivate conversion

and choose filtering like this:

 

BNC_XML_filter_settings

Eventually you should get folder structures like this:

 

BNC_XML_filtered

 

See also: XML simplification