Sectional MinHash for near-duplicate detection
MinHash is a widely-used method for efficiently estimating the amount of similarity between documents for Near-Duplicate Detection (NDD). However, it is based on the concept of set resemblance rather than near-duplication. In this study, Sectional MinHash (S-MinHash), specifically designed for the detection of near-duplicate documents, is proposed. The proposed method enhances the MinHash data structure with information about the location of the attributes in the document. The method provides an unbiased estimate of the Jaccard coefficient with a smaller variance as compared to the MinHash for same signature sizes. The experiment results showed that the Mean Squared Error (MSE) of the proposed method was around one eighth of the MSE of the MinHash. Also, document NDD with the proposed method resulted in more accuracy in compare to the MinHash and the recent method, the BitHash. The best-captured F-measure was 87.05%. Setting the number of sections s to 2 gave the best results for the tested dataset.
Roya Hasanian
Mohammad Javad Kargar