Anonymous ID: 98bd4e March 3, 2018, 9:32 p.m. No.545176   🗄️.is 🔗kun   >>7789 >>7826

>>543389

qtmerge uses the raw JSON/HTML data where relevant from 8ch, 4plebs and trumptwitterarchive as it's source data. It also merges in the JSON from qcodefag/qanonmap. It currently uses the host, board, post timestamp and post number to sync.

 

I like the idea of matching the GUIDs along with a post hash using some method we agree on.

Anonymous ID: 98bd4e March 4, 2018, 12:48 p.m. No.550251   🗄️.is 🔗kun   >>3109

>>550148

Let me clarify, HTML for just the archive pages (to capture threads not in catalog/threads.json). JSON for everything in else.

 

I'm working on how to share it, currently unoptimized and around 6 GiB of data uncompressed.

Anonymous ID: 98bd4e March 6, 2018, 9:44 a.m. No.568666   🗄️.is 🔗kun   >>8861 >>9061

>>568187

I did some research on collecting the CBTS threads from 4chan/pol the other night and the results might be useful for others. They can be found at the bottom of the page here:

 

https:// anonsw.github.io/qtmerge/catalog.html

 

It's still a work in progress.

Anonymous ID: 98bd4e March 6, 2018, 12:10 p.m. No.570074   🗄️.is 🔗kun   >>0566 >>0766 >>0809

>>569793

Excellent. Will that raw JSON data be in the DB as well?

 

>>569900

I did notice, those are great ideas. Can I suggest letting each user have their own copy/edits of the metadata? The user-specific data could then feedback into the system for suggestions to others, etc. But primarily it gives the user some way to control the interference/noise.

Anonymous ID: 98bd4e March 6, 2018, 1:44 p.m. No.570874   🗄️.is 🔗kun   >>0983

>>570660

Thanks, 4plebs is good for now, but a second witness is preferable. Started archiving Feb 15, but some old data was still available at the time.

 

For 8ch these are the oldest breads I have:

 

pol: 10509790 (2017-08-28)

cbts: 10 (2017-11-21)

thestorm: 1 (2018-01-31)

 

I don't have all breads after though, it is incomplete.

 

I've since stopped archiving pol/cbts/thestorm to save time/space.

Anonymous ID: 98bd4e March 14, 2018, 10:41 p.m. No.670526   🗄️.is 🔗kun   >>2344

Below is the qtmerge modified raw dataset (text-only) as of 2018-03-14 02:07 UTC.

 

I'm putting this out in the hopes that it may be useful to others for ETL, mining, search tools, archiving etc.

 

Some notes:

  • The data is a synthesis of the the qtmerge datasets: https:// anonsw.github.io/qtmerge/datasets.html

  • For an idea of threads that are available see: https:// anonsw.github.io/qtmerge/catalog.html

  • eventcache.json file contains the posts/tweets/etc in chronological order. The type attribute currently dictates the local object structure (working to fix this to be more clean)

  • refcache.json contains the detected post cross references (this is a work in progress)

  • The referenceID attribute is the "primary key" between the files

  • Timestamps are Unix Time and time strings are US Eastern

 

Extracted size: ~850 MiB

SHA-256 sum: d6ed89da05c0b714fc66b04ca66a8d701456d882d5f128ee1cef26c8d2e22eb6

 

http:// anonfile.com/dazfO8d4ba/qtmerge-text-2018-03-15_05.18.37.tar.bz2