Tags: highlights, automation Link: http://link.springer.com/10.1007/s10115-018-1278-7 Summary: Existing options for algorithmic evaluation of the similarity of documents are typically limited to only the vocabulary in the documents. This new approach leverages Google’s broader knowledge of language to connect terms in a given document to the vocabulary it’s normally used with. Very cool! Twitter: A new approach to text analysis that helps link a given document’s vocabulary with related terms that aren’t contained in it. Neat! Date: Sat, 30 Nov 2019 10:23:04 -0700
# Combining semantic and term frequency similarities for text clustering
Existing options for algorithmic evaluation of the similarity of documents depend on shallow measures: does this word seem important? What words is it used with? How frequent are they? Which is why this is cool—in this paper, the authors compare the language in a given document with broader knowledge of words and their synonyms:
In this paper, the Frequency Google Tri-gram Measure is proposed to assess similarity between documents based on the frequencies of terms in the compared documents as well as the Google n-gram corpus as an additional semantic similarity source.
And it works!
The experimental results demonstrate that the proposed measure improves significantly the quality of document clustering, based on statistical tests. We further demonstrate that clustering results combining bag-of-words and semantic similarity are superior to those obtained with either approach independently