IMPORTANT NOTES ABOUT THE ARCHIVES
Yesterday, I spent much of my day determining which board posts on the map came from so that I could get them properly indexed in my database. Many of these posts were damaged by hackers on their original threads. Fortunately, we’ve been archiving threads as we go along, so we have all of the thumbnails.
THE BAD NEWS
When the threads were archived, the archiving site did NOT save the full size images that go with the graphics. These, apparently, must be saved individually. To get to the full size images when they are available, you will need to enter the original URL for the image into the archive site’s search box. In a few cases, I was able to retrieve the full size images that way, but most of them had not been saved.
IMPORTANT PROCEDURE WHEN UPLOADING IMAGES
This is particularly important for those infographics that can’t be read in their thumbnail form:
After you upload that beautifully crafted highly informative infographic, click on the file link above your graphic on the thread and archive that page that comes up that is dedicated to your graphic. This will assure that your graphic is saved. Archiving the thread itself simply is not sufficient to preserve your work.
WHICH ARCHIVE TO USE: archive.is v. archive.org (WayBackMachine)
There’s an important difference between these two archives. Archive.org saves pages in a format that is about as close to the original as it can be. These pages retain the original html tagging and attributes. Original file names are preserved as well. If I wanted to scrape a chan thread saved on archive.org, I could probably do it successfully. If I save the archived thread to my own computer, I could copy things from the _files directory for the archive’s version of the saved thread directly into the _files directory for a thread saved from a chan page, and that would recover them properly.
Archive.is does not work the same way. That site converts class attributes into style attributes. From my perspective, this means that the thread can not be scraped to preserve posts individually, since I depend on class attributes to tell me what part of the post I am parsing. Also, archive.is renames the image files, making it a bit more difficult to use that site for retrieving individual thumbnails to fix broken posts.