|||

Existing options for algorithmic evaluation of the similarity of documents depend on shallow measures: does this word seem important? What words is it used with? How frequent are they? Which is why this is cool—in this paper, the authors compare the language in a given document with broader knowledge of words and their synonyms:

In this paper, the Frequency Google Tri-gram Measure is proposed to assess similarity between documents based on the frequencies of terms in the compared documents as well as the Google n-gram corpus as an additional semantic similarity source.

And it works!

The experimental results demonstrate that the proposed measure improves significantly the quality of document clustering, based on statistical tests. We further demonstrate that clustering results combining bag-of-words and semantic similarity are superior to those obtained with either approach independently

Up Next Next → John Kerry, Arnold Schwarzenegger wage ‘World War Zero’ on climate change A “War on Climate Change” certainly catches the public attention—but will it actually bring unity? ← Previous Bad RCS implementations are creating big vulnerabilities, security researchers claim How will regular end users know if their cellular carrier has messed up RCS implementation? Weird how higher levels of systemic problems can be so removed—yet so dangerous—for the individual people affected by them.
Latest posts
▵  Elon Musk attempts to explain Twitter to normal people in court
▵  TED and YouTube launch global climate initiative
▵  Embracing multilingualism to enhance complexity sensitive research
▵  The ‘Amazon effect’ is flooding a struggling recycling system with cardboard
▵  John Kerry, Arnold Schwarzenegger wage ‘World War Zero’ on climate change
▵  Combining semantic and term frequency similarities for text clustering
▵  Bad RCS implementations are creating big vulnerabilities, security researchers claim
▵  2019 Tech Trends Report — The Future Today Institute
▵  Medical Crowdsourcing: Harnessing the “Wisdom of the Crowd” to Solve Medical Mysteries
▵  Report Launch - OPSI Primer on AI for the Public Sector
▵  “Level Up”: Leveraging Skill and Engagement to Maximize Player Gameplay
▵  Beautiful is Good and Good is Reputable: Multiple-Attribute Charity Website Evaluation and Initial Perceptions of Reputation Under the Halo Effect
▵  Piret Tõnurist & Systems Change: how to get started and keep going?
▵  IBM expert Tamreem El Tohamy on bridging the skills gap in Africa
▵  The changing work of innovation for public value and social impact
▵  Former Go champion beaten by DeepMind retires after declaring AI invincible
▵  What part of “viral” content makes platforms want to encourage its spread?
▵  MTA floods NYC subway entrance because ‘climate change is real’
▵  The Demon Haunted World
▵  How to recognize AI snake oil
▵  A Systemic View of Research Impact
▵  Nobel Economics Prize Goes to Pioneers in Reducing Poverty
A brief, informal guide to doing grounded theory
▵  Adam Savage on Lists, More Lists, and the Power of Checkboxes
▵  Systems Practice, Abridged
▵  Fukushima reinvents itself with a $2.7 billion bet on renewables
▵  How Tesla’s first Gigafactory is changing Reno, Nevada
▵  “This is Sticking with Them:” Professor Explores Benefits of Model-Based Learning
Keeping the buzz in buzzwords
▵  README.txt: Introducing Into the Dataverse, the article series
▵  A ton of people received text messages overnight that were originally sent on Valentine’s Day