>Has anyone thought to take full news articles and social data dumps, per person, and do sub text matching across the entire body of text to find exact matches?
I think you're misunderstanding my idea. The idea is to identify sources of narrative scripts being pumped into the public conciousness. Remember when Trump's speech at the '16 RNC was immediately phrased as "dark" in dozens of articles, tweets, etc? We need to know who's putting out the scripts ("dark") and who's repeating the scripts ("""journalists""" that articles with "dark" are attributed to, shitter users with "dark" in their tweets, etc)
The code could work in different ways but trying to automate everything at the beginning is hard. The easiest way to start would be:
>anon notices a suspicious pattern of the same language being used all of a sudden
<like "dark"
>anon enters the string that's being repeated into a text box
<bonus points if it's pure JS that can run locally rather than requiring a server, at least initially
>code ingests search results of news, shitter, faceblack, etc with that string from the recent past
<configurable in near term increments like past hour, past day, past 2 days
>anon is provided a list of results
From this simple aggregated news & social search an anon can easily see by visually skimming the results to see how widespread the suspicious pattern of the same language being used all of a sudden is.
<next features
>let anons select search result items as suspect and enter them into a database that indexes on journalist/author, keyword, etc
>database can use search result item post date to build a timeline, to identify the earliest sources of the narrative script
At this point, with the database trained on common sources of narrative script repeating, it would be pretty doable to automate suspicious pattern detection by ingesting the full body of content from the sources and searching for sub text matches that exceed noise. Like if "the" is used in most of the article headlines and tweets, that doesn't mean shit because "the" is a common word, but if "dark", an much less common word, all of a sudden appears across article headlines and facebook posts, that would be pretty easy to pick up for human review.