Code at github.com/anonsw/qtmerge does some similar things. Check it out, maybe there are some useful ideas to lift from there: anonsw.github.io
qtmerge uses the raw JSON/HTML data where relevant from 8ch, 4plebs and trumptwitterarchive as it's source data. It also merges in the JSON from qcodefag/qanonmap. It currently uses the host, board, post timestamp and post number to sync.
I like the idea of matching the GUIDs along with a post hash using some method we agree on.
Yes, unoptimized and incomplete.
I did some research on collecting the CBTS threads from 4chan/pol the other night and the results might be useful for others. They can be found at the bottom of the page here:
https:// anonsw.github.io/qtmerge/catalog.html
It's still a work in progress.
Excellent. Will that raw JSON data be in the DB as well?
I did notice, those are great ideas. Can I suggest letting each user have their own copy/edits of the metadata? The user-specific data could then feedback into the system for suggestions to others, etc. But primarily it gives the user some way to control the interference/noise.
Thanks, 4plebs is good for now, but a second witness is preferable. Started archiving Feb 15, but some old data was still available at the time.
For 8ch these are the oldest breads I have:
pol: 10509790 (2017-08-28)
cbts: 10 (2017-11-21)
thestorm: 1 (2018-01-31)
I don't have all breads after though, it is incomplete.
I've since stopped archiving pol/cbts/thestorm to save time/space.
Below is the qtmerge modified raw dataset (text-only) as of 2018-03-14 02:07 UTC.
I'm putting this out in the hopes that it may be useful to others for ETL, mining, search tools, archiving etc.
Some notes:
-
The data is a synthesis of the the qtmerge datasets: https:// anonsw.github.io/qtmerge/datasets.html
-
For an idea of threads that are available see: https:// anonsw.github.io/qtmerge/catalog.html
-
eventcache.json file contains the posts/tweets/etc in chronological order. The type attribute currently dictates the local object structure (working to fix this to be more clean)
-
refcache.json contains the detected post cross references (this is a work in progress)
-
The referenceID attribute is the "primary key" between the files
-
Timestamps are Unix Time and time strings are US Eastern
Extracted size: ~850 MiB
SHA-256 sum: d6ed89da05c0b714fc66b04ca66a8d701456d882d5f128ee1cef26c8d2e22eb6
http:// anonfile.com/dazfO8d4ba/qtmerge-text-2018-03-15_05.18.37.tar.bz2