Only part of file: selecting within texts

The point of it


The aim is to let you get WordSmith to process only specific parts of your text files, getting rid of chunks you're not interested in.


Cut out or Keep?

Press the Cut or Keep tab to choose to cut out certain sections, and/or only to use certain sections.





Sections to Cut

Note: if you only want to remove a document header such as </header>, it is easier to do that in the general tag settings, section Document Header.


For more complex choices, you may here specify what is to be cut, where it starts (for example <introduction>) and where you want to cut to (e.g. </introduction>). You can choose to cut out up to 7 different and separate sections (<HEAD> to </HEAD> or <BODY> to </BODY>). This function is case-sensitive and cuts out any section located as many times as it is found within the whole text.



Cut start of each line/paragraph

The point of this is that some corpora (e.g. LOB) have a fixed number of line-detail codings at the start of each line. Here you want to cut them out (that is, after every <Enter>). Choose the number of characters to cut, up to 100; the default is 0. Use -1 if you want to cut everything up to the first alphabetical character at the start of each line, and -2 to cut everything up to the first tab.



Sections to Keep (contexts)

You want to select just one or two sections of the text and cut out the rest. Specify one tag to define the desired start, and one to specify the end, e.g. <Intro> to <Body>

(these would analyse only text introductions), or <Mary> to </Mary> (these would get all of Mary's contributions in the discourse but nothing else).



Here we have chosen to use 2 different sections, <Peter> to </Peter> to get the sections spoken by Peter and <Hong Kong> to </Hong Kong> to get the sections marked up as referring to Hong Kong as well.

Naturally you must be sure that there is something unique like a < or > symbol to define each section. This function is case sensitive (so it would not find <PETER>).


If you used <H1> to </H1> with this function in HTML text you'd get all the major headings in your texts, however many, but nothing else.

The "off" switch doesn't have to look like the "on" switch -- you could keep, for example, <INTRO> to </BODY> and thereby cut out the conclusion if that comes after the </BODY>.


Ignore text files not containing choices

If this is checked, your text files will be examined to ensure they contain the mark-up for sections to keep (here <Peter> and <Hong Kong>).



Once you've pressed OK, you will see that WordSmith knows you want only certain parts of each file because the Only Part of File button goes red (as will the Only if Containing button if there were sections to keep and the Ignore text files not containing choices box was checked).




See also: Tags as Selectors, Only if containing <x>, Guide to handling the BNC.



Click the Permalink button if you want to copy a link to this page.