A pruning strategy to improve pairwise comparison-based near-duplicate detection

Summary :

Efficient and accurate near-duplicate detection is a trending topic of research. Complications arise from the great time and space complexities of existing algorithms. This study proposes a novel pruning strategy to improve pairwise comparison-based near-duplicate detection methods. After parsing the documents into punctuation-delimited blocks called chunks, it decides between the categories of “near duplicate,” “non-duplicate” or “suspicious” by applying certain filtering rules. This early decision makes it possible to disregard many of the non-necessary computations—on average 92.95% of them. Then, for the suspicious pairs, common chunks and short chunks are removed and the remaining subsets are reserved for near-duplicate detection. Size of the remaining subsets is on average 4.42% of the original corpus size. Evaluation results show that near-duplicate detection with the proposed strategy in its best configuration (CHT 8, τ 0.1) has F-measure 87.22% (precision 86.91% and recall 87.54%). Its F-measure is comparable with the SpotSig method with less execution time. In addition, applying the proposed strategy in a near-duplicate detection process eliminates the need for preprocessing. It is also tunable to achieve the intended levels of near duplication and noise suppression

Authors:
Roya Hassanian
Mohammad Javad Kargar

Language Article : English

Publication date: 03 January 2019

Round: 61

Number: 2

page - from:: 1

page - to:: 33

index: ISI