|||

Existing options for algorithmic evaluation of the similarity of documents depend on shallow measures: does this word seem important? What words is it used with? How frequent are they? Which is why this is cool—in this paper, the authors compare the language in a given document with broader knowledge of words and their synonyms:

In this paper, the Frequency Google Tri-gram Measure is proposed to assess similarity between documents based on the frequencies of terms in the compared documents as well as the Google n-gram corpus as an additional semantic similarity source.

And it works!

The experimental results demonstrate that the proposed measure improves significantly the quality of document clustering, based on statistical tests. We further demonstrate that clustering results combining bag-of-words and semantic similarity are superior to those obtained with either approach independently

    Next → → John Kerry, Arnold Schwarzenegger wage ‘World War Zero’ on climate change A “War on Climate Change” certainly catches the public attention—but will it actually bring unity? ← Previous → Bad RCS implementations are creating big vulnerabilities, security researchers claim How will regular end users know if their cellular carrier has messed up RCS implementation? Weird how higher levels of systemic problems can be so removed—yet so dangerous—for the individual people affected by them.
    Latest posts
    The Verge → Researchers detail huge hack-for-hire campaigns against environmentalists
    Conversations, cybernetics, and Theory of Mind
    → Why are we exceeding the Earth’s carrying capacity?
    IDEO U's Creative Confidence Podcast → Roger Martin, Bianca Andreescu, and systemic strategy
    Reuters → Systemic lessons from South Korea’s Patient 31
    Axle → Divide & conquer
    FSG → Can Snow Clearing Be Sexist?
    The Verge → As Lambda students speak out, the school’s debt-swapping partnership disappears from the internet
    The Talk Show → “Bring It On, Haters”, With Special Guest Ben Thompson
    Facebook → Starting the Decade by Giving You More Control Over Your Privacy
    Motherboard → Leaked Documents Expose the Secretive Market for Your Web Browsing Data
    The Verge → Google’s ads just look like search results now
    MacMillan → Interference by Sue Burke
    Systemics and design principles in support of Tiago Forte’s PARA framework
    → Microsoft wants to capture all of the carbon dioxide it’s ever emitted
    → US announces AI software export restrictions for China
    → Science Conferences Are Stuck in the Dark Ages
    → This wireless power startup says it can charge your phone using only radio waves
    → Segway’s newest self-balancing vehicle is an egg-shaped wheelchair
    → Twitter announces Bluesky: a team seeking and developing an open standard for social media
    → Elon Musk attempts to explain Twitter to normal people in court
    → TED and YouTube launch global climate initiative
    → Embracing multilingualism to enhance complexity sensitive research
    → The ‘Amazon effect’ is flooding a struggling recycling system with cardboard
    → John Kerry, Arnold Schwarzenegger wage ‘World War Zero’ on climate change
    → Combining semantic and term frequency similarities for text clustering
    → Bad RCS implementations are creating big vulnerabilities, security researchers claim
    → 2019 Tech Trends Report — The Future Today Institute
    → Medical Crowdsourcing: Harnessing the “Wisdom of the Crowd” to Solve Medical Mysteries
    → Report Launch - OPSI Primer on AI for the Public Sector
    → “Level Up”: Leveraging Skill and Engagement to Maximize Player Gameplay