Anonymous ID: af3481 Tech/codefaggotry: how to archive 8kun boards July 21, 2020, 10:18 p.m. No.20526   🗄️.is 🔗kun

Basic writeup on how to archive 8kun boards to be augmented/discussed further in this thread:

 

Manual archiving (for small boards)

  1. go to the catalog

  2. open each thread in the catalog in a new tab

  3. for each thread, click to expand all images (this part can be done automatically using a script in the browser console)

  4. use the Save Page WE browser extension to grab a complete archive of the thread with all full-size images included in a .html file. I haven't tried to see if this works with video attachments, and I know it won't work with pdfs.

 

Automated archiving

  1. make requests to the catalog

  2. make requests to each individual thread in the catalog, optionally based on the time each thread was last updated

2a. save the information for each post into a database and mark deleted posts as deleted

2b. look through all the media files in each new post and download the full version of them (if not already done)

  1. repeat

  2. periodically review media files that are present in deleted posts and decide if they should be kept or not. it is hard for a program to tell the difference between content that was deleted by mods and things that disappeared due to site errors.

kekbees !!!ZDVkOGZmZmJjZDFm ID: 219626 July 22, 2020, 4:17 p.m. No.20569   🗄️.is 🔗kun   >>0575

Keeping the points of the maunal archiving method in mind:

 

Tried using fireshot plugin in firefox. it worked as desired for a single complete thread cap. used png format.

tried using MozArchive plugin for storing a single thread. works as desired for both MAFF and MHT formats.

saving as pdf in the fireshot plugin did not work correctly; it produced a blank pdf..

 

As long as the idea of "expanding all the images" is done prior to the save, these methods are sufficient for small boards such as mine. Quick and dirty, and it works.

 

For something like /comms/ or the other community boards, the automated archiving avenue would be super beneficial, imo. I'd use it just because Cadillacs are cool, too.

Anonymous ID: af3481 July 22, 2020, 7:05 p.m. No.20576   🗄️.is 🔗kun   >>1054

I use an extension called "Save Page WE":

https://addons.mozilla.org/en-US/firefox/addon/save-page-we/

Works well, and much better than a pdf or png because there is no data loss this way.

MozArchive with MAFF or MHT should be fine too but I like Save Page WE because it creates a plain .html file with all images inlined into it.

 

I've done the automated archive thing too, what I described works but my version is not currently in a state where it would be of any use to others to release it publicly.

Anonymous ID: 24d34a July 30, 2020, 10:59 p.m. No.21652   🗄️.is 🔗kun   >>1655

>>21608 (off-bread)

>using wget for archiving

 

this works. the resulting html page isn't immediately browseable (no styles, links to media are broken) but it does grab all posts and media files:

wget https://8kun.top/comms/res/21322.html --no-clobber --recursive --level=1 --span-hosts --domains=media.8kun.top --wait=0.3 --random-wait

this will not re-download files that already exist, which is what you want for media but not for the thread HTML file. so if you want to re-archive a thread (like when it gets new posts) find the old .html file and delete it or rename it before running that command.

 

other stuff that could be improved:

  • this will download both thumbnails and full versions of media files. I think the –accept-regex option is the way to fix this.

  • this might not be able to catch some errors like corrupted media files by itself (sometimes downloads fail partway through and they probably won't be re-downloaded).

Anonymous ID: 24d34a July 30, 2020, 11:04 p.m. No.21655   🗄️.is 🔗kun

>>21652

that line break in the middle of "–level=1" is not supposed to be there.

fixed command here:

https://www2.qanonbin.com/paste/1H1Wur2Fv