Existing options for algorithmic evaluation of the similarity of documents depend on shallow measures: does this word seem important? What words is it used with? How frequent are they? Which is why this is cool—in this paper, the authors compare the language in a given document with broader knowledge of words and their synonyms:
In this paper, the Frequency Google Tri-gram Measure is proposed to assess similarity between documents based on the frequencies of terms in the compared documents as well as the Google n-gram corpus as an additional semantic similarity source.
And it works!
The experimental results demonstrate that the proposed measure improves significantly the quality of document clustering, based on statistical tests. We further demonstrate that clustering results combining bag-of-words and semantic similarity are superior to those obtained with either approach independently