Only part of file: selecting within texts

  Previous topic Next topic JavaScript is required for the print function  

 

 

select_only_sections_of_text_files

 

Cut start of each line/paragraph

The point of this is that some corpora (e.g. LOB) have a fixed number of line-detail codings at the start of each line. Here you want to cut them out (that is, after every <Enter>). Choose the number of characters to cut, up to 100; the default is 0. Use -1 if you want to cut everything up to the first alphabetical character at the start of each line, and -2 to cut everything up to the first tab.

 

Sections to Cut

If you are using text files with SGML, XML or HTML headers (e.g. the British National Corpus) you may simply want to cut out the header from your word lists, concordances, etc. as shown in the Document header example.

 

For more complex choices, you may here specify what is to be cut, where it starts (for example <Introduction>) and where you want to cut to (e.g. </Introduction>). You can choose to cut out up to 3 different and separate sections (<HEAD> to </HEAD> or <BODY> to </BODY>). This function is case-sensitive and cuts out any section located as many times as it is found within the whole text.

 

Sections to Keep (contexts)

You want to select one section of a text and cut out the rest. Specify one tag to define the desired start, and one to specify the end, e.g. <Intro> to <Body>

(these would analyse only text introductions), or <Mary> to </Mary> (these would get all of Mary's contributions in the discourse but nothing else).

 

Naturally you must be sure that there is something unique like a < or > symbol to define each section. This function is case sensitive (so it would not find <MARY>).

If you used <H1> to </H1> with this function in HTML text you'd get all the major headings in your texts, however many, but nothing else.

 

You can choose to use 2 different sections, e.g. <Intro> to </Intro> to get the introduction and <Conclusion> to </Conclusion> to get the conclusion as well. The "off" switch doesn't have to look like the "on" switch -- you could keep, for example, <INTRO> to </BODY> and thereby cut out the conclusion if that comes after the </BODY>.

 

select within 2 choices

In this example, all the Peter section and all the Hong Kong sections will be used for the word-list, concordance etc., but nothing else.

 

See also: Tags as Selectors, Only if containing <x>, Guide to handling the BNC.

 

 

Page url: http://www.lexically.net/downloads/version5/HTML/?selectingwithintexts.htm