data > corpora > Bible:
TRACER_DATAfolder would be created in your specified directory (e.g.
data > corpora > MyTexts).
01-02-WLP-lem_true_syn_true_ssim_false_redwo_false-ngram_5-LLR_true_toLC_false_rDia_false_w2wl_false-wlt_5Let's break it down:
invstands for inverted list. This file works like a word index and is the heart of any retrieval system. It shows you the position of a given word in a given textual segment/unit. For example, the row
114 4003870 2in an
.invTRACER file means that word
114can be found in segment
wncstands for words number complete. This file provides a list of all the types in the corpus, including rank, word length and frequency. The list and the IDs are frequency-sorted.
RANK WORD LENGTH FREQUENCY
WORD_TYPES: number of unique words as dictionary entries.
WORD_TOKENS: occurrences of a word; every word in a text, no matter how many times it occurs.
SOURCES: for example, a book.
SSIM_THRESHOLD: degree of similarity required to consider two words as similarly written.
SSIM_EDGES: number of links or word pairs satisfying the similarity requirements stated in
BOW_WORD_TOKENS: tokens or unique words that appear in a line.
.charinformation the user can more easily identify and remove it from the text(s).
WORD-ID WORD FREQUENCY
FEATURE-ID REUSE-ID POS
FEATURE-ID= see below.
REUSE-ID= ID from the input file, first column.
POS= position of the feature in the reuse under
REUSE-ID. In the case of n-grams as features the position is always that of the first word in the n-gram.
.trainfile to the word(s) that make up that feature. Specifically, if you are using words as features, the numbers will be identical. If, instead, you are using n-grams, the second column contains the IDs of each word in the n-gram).
.selfile is the same as the
.trainfile only without the removed words.
.scorefile contains all computed reuse pairs. The first two columns list the IDs of the aligned reuse units, the third column displays the number of shared features (absolute overlap) and the fourth column the degree of similarity (weighted overlap). A similarity of
0.1is 10%, of
0.2is 20% and so on until
1= 100%. In the example below, the two sentences have two features in common for a total similarity of 50%:
1101991 1300887 2.0 0.5
.scorefile lists results bidirectionally and thus redundantly. The two results below, for example, represent the same reuse alignment but the order of the IDs is inverted:
1102581 1300887 2.0 0.5
1300887 1102581 2.0 0.5