Show/Hide Toolbars

WordSmith Tools Manual

Navigation: Utility Programs > File Utilities

tags

Scroll Prev Top Next More

Mis-matched tags problem

 

well-formed text is like this:

<introduction>blah blah blah</introduction>

<body>blah blah blah</body>

<conclusion>blah blah blah</conclusion>

 

ill-formed text might be like this:

<introduction>blah blah blah</introduction>

<body>blah blah blah

<conclusion>blah blah blah</conclusion>

 

The body section of the text has no explicit end. Imagine for a corpus of text files you wanted to process Only Part of Text, specifying that you want both the introduction and the body, avoiding the conclusion of each. If you specified that you wanted from <introduction> to </body> something would go wrong, such as the ill-formed case getting missed completely or running on to include the conclusion.

 

Solution

 

tag_mismatch_fixing

This procedure will examine all suitable text files and check whether every start has a corresponding end. Here we are checking from <text> to </text>.

If a section is found where there is

<text> blah blah blah

<text> blah blah blah

</text>

the procedure will notice that the first <text> section hasn't finished before a second one starts. It will create a text file showing all the contexts where there were mismatches.

 

Fix automatically?

If fix automatically is checked the procedure will insert </text> just before the second <text>.

<text> blah blah blah

</text><text> blah blah blah

</text>

Any original text files with mismatched sections will then be fixed and saved with a .fixed file extension.

 

Note that this may or may not be exactly what you want. Imagine <body> sections always follow an <introduction>.

<introduction> blah blah blah </introduction>

<body> blah blah blah

<introduction> blah blah blah </introduction>

<body> blah blah blah </body>

 

If you automatically fix this <body> .. </body> text you will get

<introduction> blah blah blah </introduction>

<body> blah blah blah

<introduction> blah blah blah </introduction>

</body><body> blah blah blah </body>

 

because the procedure inserts the end tag just before the second start tag. So you may want not to fix automatically. You can edit each text file manually because the saved mismatches.txt will show you the contexts where there were mismatches.