WordSmith Tools Manual

Navigation: Concord > editing concordances

Remove duplicates ()

The problem

Sometimes one finds that text files contain duplicate sections, either because the corpus has become corrupted through being copied numerous times onto different file-stores or because they were not edited effectively, e.g. a newspaper has several different editions in the same file. The result can sometimes be that you get a number of repeated concordance lines.

Solution

If you choose Edit |Remove Duplicates, Concord goes through your concordance lines and if it finds any two where the concordance lines match each other (regardless of the filename, date etc.) it will mark one of these for deletion. To establish a match it examines 500 characters centred on the search-word in each line. Every single character including punctuation would need to be identical for a match.

Check before you zap...

At the end it will sort all the lines so you can see which ones match each other before you decide finally to zap the ones you really don't want.

See also: the Corpus Checker utility's find duplicate contents function which finds near duplicates.