The point of it
The aim here is to find repeated chunks, such as can get caused
oif someone has inserted a paragraph twice by mistake
oby plagiarism
oby re-writing and editing text
oin copying and pasting,
oby quoting,
oor standard chunks of text used as jargon or for convenience.
The procedure looks essentially for repeated sentences and headings in a whole lot of texts.
How to do it
Press the Start button to choose a folder. It will be searched as can all its sub-folders, except any called filtered or moved.
Choose the file-types to search (default *.* ) and a tag span such as 200 characters, since mark-up gets ignored in this search. Set the minimum number of hits and the minimum number of hits there must be per text file. Min. length is the length of any repeated chunk.
Include unterminated sentences: includes headings.
Press the Start button. Here we should get any repeated chunks over 25 characters which come at least 15 times overall, at least once in each text file.
The program has examined over 7,200 text files (out of about 9,000) and has found a lot of chunks. Most are not repeated as we set the minimum per file to 1.
At the end of the search (it took about a minute), the program found about 330,000 repeated chunks, reduced by these settings to just under 23 entries.
Most are like this:
Messages to the reader about legal responsibilities or offering choices to readers, not directly concerned with the main text content..
Some like cluster 21 more concerned with text content:
A concordance shows this:
Very clearly this boilerplate chunk is a quote from an expert during a heatwave.
The screen-shots above show a focus on Chunks.
If you change the focus to Files, you get the list ordered according to which files contain the most different boilerplate chunks. (Press the Freq. header until you get the highest to the top)
The list looks more specific and topic-focussed because the text file-names in this study are very informative. The top few seem to relate a lot to climate change and weather.
Further Options
At the top are various buttons offering options:
Show this text
which for the Nebraska text showed this:
Highlighting all (right-click) got this:
showing (for this text) that all 6 cases of 5 boilerplate strings come right at the end of the text once the main text message is finished.
You can click the marks in the plot to see each of the references to the same piece of boilerplate. Or go through them with the < and > arrows.
Extreme boilerplate
Some texts contain huge amounts of repetition. This example is from "live" texts where journalists keep editing and adding to a story building. In this case it was a story about temperature records and many strings got repeated. It is a bit like the difference between a photo and a movie where the same image slightly varied gets repeated.
|
Compute Concordance
This passes the set of file-names to Concord and the chunk of boilerplate text, for concordancing.
Save and Load
These save or retrieve a saved listing.
Save listing
This lets you save the listing, either to the clipboard from which you can paste it wherever you like, or to an Excel spreadsheet:
See also: duplicate file contents, corruption check, duplicate file-names,