Anonymous ID: dbb4a4 Feb. 25, 2018, 8:59 p.m. No.498755   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun   >>9341

>>495005

I've been working on exactly this. I'm pulling the catalog from ga & qresearch. Finding the research general threads and saving those with q posts. Only goes back to about 2/15 when I turned the machine on. Currently working on getting old posts reconstructed. 99% sure I can grab all breads from 8ch.

 

C# dll to scrape q posts and threads from 8ch. 8ch+ json format but could be serialized XML I guess

Anonymous ID: dbb4a4 Feb. 28, 2018, 2:24 p.m. No.520068   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun   >>0151

>>499327

>>499341

 

OK So I think I've got my chanscraper console app working as designed.

 

AFAIK, I've got all the QPosts in a single JSON, I've got complete breads starting with Bread #364 2018-02-07. That's as far back as I've been able to reach programatically. Each complete bread has also been filtered into another json file containing just Q's posts.

 

The complete breads have only come from 8ch. The chanscraper is set up to whee it could scrape 4ch as well - assuming the json is still available.

 

I'm showing 825 QPosts - 1 more than qCodeFag because I believe I have a deleted

one. All counted it's 210 threads.

 

I've done all the hard work of setting up the old catalog/threads/posts. Its set up where you can specify how far back to refresh (to cut down on unnecessary http gets), It reads in the existing data, finds the new threads to search for on 8ch/greatawakening and 8ch/qresearch, and then archives the threads/posts that q has made locally.

 

If anybody wants the full Q archive as I have it now, here it is: 6mb https:// anonfile.com/H6B7G7dcbc/QJsonArchive.zip

 

I'm going to integrate the DJTweets + minute Deltas in this week.

 

Once I get this all cleaned up I'll cut it loose on Github if there are any C#codeFags interested.

 

My idea is to set up a simple HTML page using some javascript that can be run locally on a single users machine or website. Since the scraper is a C# dll it could be set up to run as a timed service on a web server to keep a site up to date.

Anonymous ID: dbb4a4 Feb. 28, 2018, 2:41 p.m. No.520179   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun   >>0263

>>520151

Yeah I knew about that - but I'd already been getting data from QCodeFag. The QCodeFag data was the basis for what I have now since it had already done the scraping on 4ch. I wanted my own in C# source going forward that I can use locally with my other C# code.

Anonymous ID: dbb4a4 March 1, 2018, 12:11 p.m. No.527353   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun

>>520237

Here's the archive again + a handy HTML page that you can use in your browser to view the archives locally. Works fine in Chrome and IE. Readme included.

 

https:// anonfile.com/W3f5H6d8be/QJSONArchive.zip

Anonymous ID: dbb4a4 March 1, 2018, 9:04 p.m. No.530677   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun

>>530283

Here's a newer local archive that moves there.

I've put in some UI enhancements to the JSON Viewer HTML page. Seems to be working good. With a slight mod it could work with local json from any QCodeFag site or even direct from 8ch.

https:// anonfile.com/5ercH3d9ba/QJSONArchive_v1.zip

 

Getting the posts into 2 columns should be no problem. It's getting a reliable news source that is gonna cause you trouble.

 

I was planning on putting 3 columns in the viewer, QPosts, Times, DJTweets. In doing all this I've discovered a few things about 8ch/halfchan. The post id's are not guaranteed unique. The best unique key is time and I've found 2 posts that dropped at the same timestamp. Thematically I've been trying to key everything to time. [qposts, tweets, news]

Anonymous ID: dbb4a4 March 3, 2018, 6:15 p.m. No.543389   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun   >>5176

>>540555

What is everybody using as their sources for drops? 8ch? One of the QCode forks? Something else?

 

How do we verify that our collections are the same?

 

I've been adding a Guid for each post I scrape, just to give them all a unique value.

Anonymous ID: dbb4a4 March 4, 2018, 7:23 a.m. No.547826   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun   >>8084

Phonefag right now.

>>545176

>>547789

There's an md5 field as you know in the 8ch json, but it wasn't in the data I got from Qcodefag. Because he'd modified the .com to strip HTML into a.text field.

 

My chanscraper keeps the md5 and the .com and strips HTML into .text.

 

Any C#fags here?

I did set up a GitHub yesterday and push the chanscraper out. Gonna get the Twitter stuff mashed in the next few days.

Anonymous ID: dbb4a4 March 4, 2018, 8:08 a.m. No.548084   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun   >>8229

>>547826

Just ran my chanscraper again since apparently there were new posts last night as I was jacking around with Github.

 

I checked my posts with what's on qresearch and I think I'm good. Showing 839 total now.

New Q posts from 828 - 839.

 

I found a bug in the ChanScraper code too. A thing I've been working on that I forgot to remove. I'll push it out too and then link the GitHub.

Anonymous ID: dbb4a4 March 4, 2018, 8:26 a.m. No.548229   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun   >>8433 >>9586

>>548084

Here's the link to my new GitHub

https:// github.com/QCodeFagNet/SFW.ChanScraper

 

If you are going to run the ChanScraper and then view the posts locally, when you open the QJSONViewer.html page, don't open the [json_allQPosts.json] file, open the newly generated [bin\json_allQPosts.json] file.

 

The machine needed me to include all the existing posts/work json. It's kind of clunky the way I'm doing it because I want to keep this updated with the latest posts/work json. But for a normal user everything is kept updated automagically in the bin\json folders. The project is set up to copy new files if newer - so everything should be kept in sync.

 

If you are planning on running this locally you'll need the .NET framework 4.5 at least. Probably better to go with 4.5.2

https:// www.microsoft.com/net/download/dotnet-framework-runtime/net452

Anonymous ID: dbb4a4 March 4, 2018, 12:29 p.m. No.550148   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun   >>0218 >>0251 >>1411

>>549377

Tedious Dayum. Think you could convert your full bread scrape into some json?

 

>>549586

Gotta link to one of the JSON files?

 

>>548564

Here's a mini local JSON viewer as an HTML page + allQPosts.json. @225KB

 

Includes all QPosts up to 2018-03-04T11:29:14

 

https:// anonfile.com/06HeJbdeb6/Mini_Local_JSONViewer.zip

 

I was just thinking that what we really need, to start off with is a single schema that we can all agree on. It will go a far way in interoperability.

 

I'm going to run some tests on my local QCodeFag install and see if it will work off of the ChanScraper _allQPosts.json file. I think it should.

 

The JSONViewer could work with straight files from 8ch or 4ch with a single minor change I forgot to put in.

Anonymous ID: dbb4a4 March 4, 2018, 12:33 p.m. No.550167   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun

>>549586

The ChanScraper includes the full JSON archive as of this morning. I haven't need to go back to any archive.is HTML archives because I've been collecting breads locally since the beginning of Feb. All the Q Posts before that I sourced from the QCodeFag forks.

Anonymous ID: dbb4a4 March 4, 2018, 12:42 p.m. No.550218   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun

>>550148

Here's what the JSON schema I'm working with looks like.

 

[

{

"source": "qresearch",

"threadId": 544266,

"link": "https:// 8ch.net/qresearch/res/544266.html#544985",

"imageLinks": [

{

"url": "https:// media.8ch.net/file_store/ffd6128f5949e4d4f6f3480236a63be002ffc5e59c0a31714360624d8ce45170.jpeg"

},

{

"url": "https:// media.8ch.net/file_store/ffd6128f5949e4d4f6f3480236a63be002ffc5e59c0a31714360624d8ce45170.jpeg/B42CA278-6C32-4618-A856-0CB9B680CC38.jpeg"

}

],

"references": [

{

"source": "qresearch",

"threadId": 0,

"link": "https:// 8ch.net/qresearch/res/0.html#548166",

"imageLinks": [],

"references": [],

"no": 548166,

"uniqueId": "19294a1b-8cae-435d-9503-8eb70c573d6b",

"_unixEpoch": "1970-01-01T00:00:00Z",

"text": "\r\r>>548157\r\rAlso not a real Q post\r\rQ",

"postDate": "2018-03-04T11:19:47",

"time": 1520180387,

"tn_h": 0,

"tn_w": 0,

"h": 0,

"w": 0,

"tim": null,

"fsize": 0,

"filename": null,

"ext": null,

"md5": null,

"last_modified": 1520180387,

"sub": null,

"com": "<p class=\"body-line ltr \"><a onclick=\"highlightReply('548157', event);\" href=\"/qresearch/res/547414.html#548157\">&gt;&gt;548157</a></p><p class=\"body-line ltr \">Also not a real Q post</p><p class=\"body-line ltr \">Q</p>",

"name": "Q ",

"trip": "!UW.yye1fxo",

"replies": 0

}

],

"no": 544985,

"uniqueId": "35c759aa-4998-4009-83a7-2af1b3273f28",

"_unixEpoch": "1970-01-01T00:00:00Z",

"text": "\r\r>>548166\r\rNOT A REAL Q POST\r\rQ",

"postDate": "2018-03-04T00:17:27",

"time": 1520140647,

"tn_h": 237,

"tn_w": 255,

"h": 1114,

"w": 1200,

"tim": "ffd6128f5949e4d4f6f3480236a63be002ffc5e59c0a31714360624d8ce45170",

"fsize": 271479,

"filename": "B42CA278-6C32-4618-A856-0CB9B680CC38",

"ext": ".jpeg",

"md5": "CbsCGk0pVEahunzSuV4LKw==",

"last_modified": 1520140647,

"sub": null,

"com": "<p class=\"body-line ltr \"><a onclick=\"highlightReply('548166', event);\" href=\"/qresearch/res/547414.html#548166\">&gt;&gt;548166</a></p><p class=\"body-line ltr \">NOT A REAL Q POST.</p><p class=\"body-line ltr \">Q</p>",

"name": "Q ",

"trip": "!UW.yye1fxo",

"replies": 0

}

]

Anonymous ID: dbb4a4 March 4, 2018, 6:30 p.m. No.553092   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun

>>551411

Yeah I've dug thru all the html looking for a reference to a json file. Can't find a reference to one either. My guess is, that once it drops off the main thread catalog, the JSON is no longer available. Too bad because that's the meat in a simple format.

 

No the machine is more of a scraper (grab data and save it) than a parser. It does parse the HTML out of the .com field into .text like QCodeFag does though. It's not designed to read thru html pages to look for posts.

 

It has a local baseline archive of everything.It reads in that entire local and then figures out the json breads it needs to download from the 8ch/qresearch/catalog.json. Then it downloads all those new breads and resets itself so you don't download everything every time - only the breads from the past [x] days.

Anonymous ID: dbb4a4 March 4, 2018, 8:30 p.m. No.554074   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun

Here's an updated mini local JSON viewer as an HTML page + allQPosts.json. @225KB

I updated it so it works with the raw json from 8ch.

https:// 8ch.net/qresearch/res/553655.json

Could probably use an [ascending/descending] button butโ€ฆ

 

Includes all QPosts up to 2018-03-04T11:29:14

 

https:// anonfile.com/z4U1Jdd9b9/Mini_Local_JSONViewer.zip

 

If folks don't like a zip, it's only 2 files they can download the HTML file (ChanScraper) and the allQPosts.json (Console\bin) file on github https:// github.com/QCodeFagNet/SFW.ChanScraper

Anonymous ID: dbb4a4 March 5, 2018, 3:27 p.m. No.560076   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun   >>0415 >>4762

>>555095

holey phuck. 193 GB. That's for a full archive of all breads + images? My local scrape of Q breads and posts as text only comes in at 6mb. My local QCodeFag install with text + Q images is just under 100mb.

 

193GB is getting unmanageable.

Anonymous ID: dbb4a4 March 6, 2018, 9:07 a.m. No.568187   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun   >>8666 >>9170

>>564762

Yeah it's not totally unmanageable. It's more like moving a full grown oak tree. You can do it, but it's a huge pain in the ass. I was thinking more in terms of moving it around the internet or hosting. That's a pretty big db.

 

I rejiggered the ChanScraper to archive all the breads even if there isn't a Q post in that bread. It rendered 215 NEW complete breads and brought my jason net filesize from 6MB to 200MB. Starts around "Q Research General #358".

 

That's with no images, just the raw JSON from 8ch. Each bread is around 700kb.

Anonymous ID: dbb4a4 March 6, 2018, 1:16 p.m. No.570604   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun   >>5010

I've rejiggered the ChanScraper to produce TwitterSmashed json. It includes any DJTweets within 60 mins of a Qpost. Here's what a [5], [8], [10] deltas look like.

 

{

"DJTtwitterPosts": [

{

"accountId": "realDonaldTrump",

"accountName": "Donald J. Trump",

"tweetId": 944665687292817415,

"text": "How can FBI Deputy Director Andrew McCabe, the man in charge, along with leakinโ€™ James Comey, of the Phony Hillary Clinton investigation (including her 33,000 illegally deleted emails) be given $700,000 for wifeโ€™s campaign by Clinton Puppets during investigation?",

"delta": 5,

"link": "https:// twitter.com/realDonaldTrump/status/944665687292817415",

"uniqueId": "00e6951d-5f49-455b-bdd9-bda7f184d9c7",

"time": 1514060825,

"_unixEpoch": "1970-01-01T00:00:00Z",

"postDate": "2017-12-23T15:27:05"

},

{

"accountId": "realDonaldTrump",

"accountName": "Donald J. Trump",

"tweetId": 944666448185692166,

"text": "FBI Deputy Director Andrew McCabe is racing the clock to retire with full benefits. 90 days to go?!!!",

"delta": 8,

"link": "https:// twitter.com/realDonaldTrump/status/944666448185692166",

"uniqueId": "92fbb1a2-169e-412c-abba-6e441d3acbaa",

"time": 1514061006,

"_unixEpoch": "1970-01-01T00:00:00Z",

"postDate": "2017-12-23T15:30:06"

},

{

"accountId": "realDonaldTrump",

"accountName": "Donald J. Trump",

"tweetId": 944667102312566784,

"text": "Wow, โ€œFBI lawyer James Baker reassigned,โ€ according to @FoxNews.",

"delta": 10,

"link": "https:// twitter.com/realDonaldTrump/status/944667102312566784",

"uniqueId": "eabb202f-3b59-48c9-b282-f0110b8388a5",

"time": 1514061162,

"_unixEpoch": "1970-01-01T00:00:00Z",

"postDate": "2017-12-23T15:32:42"

}

],

"no": 158078,

"name": "Q",

"trip": "!UW.yye1fxo",

"sub": null,

"com": null,

"text": "SEARCH crumbs: [#2]\nWho is #2?\nNo deals.\nQ\n",

"tim": null,

"fsize": 0,

"filename": null,

"ext": null,

"tn_h": 0,

"tn_w": 0,

"h": 0,

"w": 0,

"replies": 0,

"md5": null,

"last_modified": 0,

"source": "8chan_cbts",

"threadId": 157461,

"link": "https:// 8ch.net/cbts/res/157461.html#158078",

"imageLinks": [],

"references": [],

"uniqueId": "e22306cc-2831-453a-ae1d-16e90aa23707",

"time": 1514060541,

"_unixEpoch": "1970-01-01T00:00:00Z",

"postDate": "2017-12-23T15:22:21"

}