Anonymous ID: 23d7ce How to archive a website offline April 9, 2018, 5:40 p.m. No.974637   🗄️.is 🔗kun   >>1764 >>0990 >>1142 >>6432 >>2172 >>2298 >>4029 >>5562 >>7600 >>0154

This is a guide for archiving websites offline using HTTrack:

https:// www.httrack.com/

 

If someone knows of a different method, please feel free to talk about it here. Also, I'm no expert on this…so any tips are welcome.

 

The benefits of copying (aka "mirroring") websites or website pages offline are myriad. For one, you know it won't be deleted unless you delete it. For another, you get a complete copy of the structure of the site–from the directory structure on down. This has actually led me to find folders that I wouldn't have otherwise knew existed, as well as other things that the designer might have tucked away.

 

The downside is that it can be slightly complicated. That's why this anon is writing this guide–to help you identify common errors, and how to overcome them. By twiddling with certain settings and understanding what they mean, you'll be getting results in no time.

Anonymous ID: 23d7ce April 9, 2018, 5:43 p.m. No.974690   🗄️.is 🔗kun   >>2298

Step 1: go to this page and select the version appropriate for your system:

https:// www.httrack.com/page/2/en/index.html

 

Go ahead and install it where you want. I can't remember what options pop up, but if it asks you where you would like to archive your stuff, give it an appropriate directory–there will be one folder for each site/page you download, so I recommend you have a completely separate directory/folder so it doesn't mess everything else up.

Anonymous ID: 23d7ce April 9, 2018, 5:45 p.m. No.974732   🗄️.is 🔗kun   >>7620

Step 2: Once you've installed it, click "next." The blackened area in the image is where you'll see your directory structure. I've hidden mine so you don't see my 2.9 TB directory of nasty midget porn.

Anonymous ID: 23d7ce April 9, 2018, 5:49 p.m. No.974792   🗄️.is 🔗kun

Step 3: Give your project an appropriate name; I recommend naming it after the website that you're going to mirror. After that, put it into an appropriate category–in my example, I've put it under "qresearch_NK", which is where I've put my other North Korea-related mirrors as well.

 

After that, click on "next"

Anonymous ID: 23d7ce April 9, 2018, 5:51 p.m. No.974849   🗄️.is 🔗kun

Step 4: copy & paste the url from your browser into the indicated box. Before you move forward, the most important step comes up: you have to set the options.

 

These options are rarely "one size fits all." Different websites have different setups, so you've got to adapt your setup in order to get what you want. We'll get into that next.

 

After your options are set, click "next"

Anonymous ID: 23d7ce April 9, 2018, 5:53 p.m. No.974883   🗄️.is 🔗kun

Step 4a: If you aren't using a proxy, un-click the "use proxy for ftp transfers" box under the "Proxy" tab.

Anonymous ID: 23d7ce April 9, 2018, 6 p.m. No.975019   🗄️.is 🔗kun

step 4b: Under the "Scan Rules" tab, make sure you check each of the boxes if you want to download that type of media for the page. Typically you want to get the pictures that go with the site, so check the "gif, jpg, jpeg…" box. If there are movies on the site that you want, check the "mov, mpg, mpeg…" box. If the website has files that you can download, select the "zip, tar, tgz…" box.

 

What you select is really about what you're after–if you want a complete record, select all of them…but if you just need the text, don't check any. Whether or not you select these items can make a huge difference in how big the result is–but don't worry: if it's taking to long or the result is getting too huge, you can always cancel and try again.

Anonymous ID: 23d7ce April 9, 2018, 6:03 p.m. No.975096   🗄️.is 🔗kun

Step 4c: This setting will tell HTTrack how to go through the website. If you want more information, go here:

https:// moz.com/learn/seo/robotstxt

 

I'd say keep it off, but sometimes you run into issues with this setting…so I'm mentioning it here because having the wrong setting often gives an error, and you can try twiddling it between "follow" and "don't follow" to fix the error.

Anonymous ID: 23d7ce April 9, 2018, 6:08 p.m. No.975203   🗄️.is 🔗kun

Step 4d: Under the "Browser ID" tab, you have the option of setting your "Browser Identity". I'm guessing that, by telling the website which browser that you're using, the website will present certain features in order to take advantage of that browser. If you find that you get an error almost immediately after trying to move forward (like pic related), go into your options and change these to "none" and it should clear it up.

Anonymous ID: 23d7ce April 9, 2018, 6:14 p.m. No.975296   🗄️.is 🔗kun

Step 4e: If, after you mirrored the website you've found that you didn't get what you wanted, you might try messing with these settings. Essentially what they do is tell HTTrack how to move about the website–can it only move downward through the directory structure, or can it go upward as well?

 

Depending on how the website is set up, you may have to mess with these…but I suggest making the settings slightly less restrictive each time, until you get only what you need. The reason I say this is that you may find yourself downloading all manner of things from every website connected to your target–every ad from the ad sites, every movie from links, etc. When I was downloading liddlekidz.org, I found myself well past 2 GB before I realized that I wasn't just getting stuff from that website–I was pulling stuff from at least a dozen websites, and most of it was being downloaded before the stuff from liddlekidz. So be conservative here, otherwise you're wasting your time and hard drive space.

Anonymous ID: 23d7ce April 9, 2018, 6:16 p.m. No.975311   🗄️.is 🔗kun

Step 5: after you've set your options and clicked "next", you'll get to this page. Just click "next," and hopefully everything goes well.

Anonymous ID: 23d7ce April 9, 2018, 6:28 p.m. No.975521   🗄️.is 🔗kun   >>0849

Step 6: After HTTrack has completed, you'll get this page. If there's an error, you'll get a flashing notifier–you can take a look at the log to get the details, and use that information to search the web for a solution. For the most part, twiddling with the settings that I've mentioned will handle any of the errors you get…and it won't take long before you get familiar with them.

 

Sometimes, an error is essentially meaningless. For instance, I often get errors that state that HTTrack couldn't get an image from an ad site because of my settings–that's not important, so I don't worry about it.

 

You can click on "browse mirrored site" to see how your copy looks. If you're unhappy, change the options and try again.

 

Finally, you can go into your archive folder, and you'll see a new folder with the project. You can go in there any time, click on "index.html," and it will open up your fresh new copy of the website.

 

Now get out there and archive offline!

 

One final note: if you find something really, like that image of Hillary sacrificing children to Moloch that we all know is floating around, make sure to archive first before you come here and tell everyone else. We know that 8ch is being watched, and by blabbing what you've found you're giving them a chance to pull their stuff offline before anyone else can get to it. But once you've got it, then by all means, tell everyone–the more people there are that have a copy, the better it is for you…after all, you don't want to be the only person with that kind of evidence on your hard drive, do you?

 

Happy archiving!

Anonymous ID: 23d7ce April 10, 2018, 3:04 p.m. No.987840   🗄️.is 🔗kun   >>8012

>>985562

I think I have a simple solution. I can't quite verify yet, but so far it seems to be doing what I think you would want it to do.

 

First, you need to go to twitter's advanced search*:

https:// twitter.com/search-advanced?lang=en&lang=en

 

After entering the user's name, and the starting date from which you would like to collect tweets, click "search"

 

After that, you'll get a results page. Copy & paste the url into HTTrack. I didn't have to adjust my settings at all–I just chose to download the images, not movies.

 

From then on, it should start downloading without any serious issues. In my first two images, I did a search for @snowden's tweets from October 25th today. In my trial run (which I'm currently still processing), I chose to grab all of @JulianAssange's tweets since 1-1-2017. Big mistake–as the poor man is locked up, he probably averages about 10-15 tweets a day. So far as I can tell, not only am I gathering his tweets, but also the tweets of those that he has retweeted and the tweets on their profiles. I'm at 9 GB far, and almost 20k files downloaded.

 

I'm sure there's a setting somewhere that might tell it not to go to far, but I haven't figured that out yet. Regardless, it's pretty much doing what you would want–as you can see from the image, there's a separate folder for each person that Assange has interacted with. Below those folders are some completed .html files, and tons of html.tmp files–which are basically unfinished downloads.

 

When this is all done, I'll go over it and confirm how well it turned out. At this point, I can individually click on some of the .html files and they bring up profiles, so I'm pretty confident.

 

*Twitter's advanced search doesn't work well on mobile devices. If you can't find it on your mobile device and want to reach it, go into your browser's settings and click "request desktop view." Also, it may not be necessary to go to advanced search at all–if you look in the "Capture1.jpg" image I've posted, you can see "from:snowden since:2017-10-25". You may just be able to enter a query like that into their regular search to get the results you want.

Anonymous ID: 23d7ce April 10, 2018, 4:06 p.m. No.988648   🗄️.is 🔗kun   >>8055 >>8778 >>8857

>>988012

You actually do kind of get a timestamp, in that the creation of the files on your computer have a creation date as they are written. So long as you don't go about editing them, the date remains intact. You might want to consider making a copy and storing it someplace safe; if it were for a legal case, I would perhaps throw it onto a thumb drive and give that to, say, a lawyer or notary public. You could upload it to some website, but if all of this archiving is about having information while the web is down, then that presents a problem…

 

I've found something interesting about archiving with HTTrack, but it doesn't solve the Twitter problem. You can set the "mirror depth" to a certain number, which represents the number of "clicks" away from your page you want to copy. If you don't set this (in options >limits >> Maximum mirror depth), you may wind up with a gigantic download.

 

Consider this scenario: You're downloading the last 100 tweets from someone. In one of those tweets, they re-tweeted someone else…so that person is clicked on, which brings up all of their tweets. Each of those is clicked on…and on and on and on…

 

So this is what I recommend: for social media, start with a setting of 1 or 2; if it doesn't get enough, bump it up one until you get what you need. Leaving it unset means that it will continue onwards with infinite clicks–when it comes to social media, that means that it could go on for a -very- long time, as people quote other people, etc.

 

As far as Twitter is concerned, I've tried it with a setting of 1, 2, and 3; 1 and 2 got me his profile, 3 got me his profile in a ton of different languages and maybe a month's worth of tweets (with Chinese headings). So it doesn't look like it's necessarily a productive means of getting the info you want–more than likely it's going to be a matter of using the api to get the results you want.

 

I found the issue I've been having–it has to do with another setting.

Anonymous ID: 23d7ce April 11, 2018, 7:24 p.m. No.1006231   🗄️.is 🔗kun   >>6290

>>988857

I think I misspoke when I said "headings." What I meant was that the entire page is in Chinese, except for the tweet itself.

 

What's happening is that I'm getting copies of certain .html files in every language–1000 in total for the index, login, and search .html files.

It looks like they have a typedef set up to link a language to a 4 digit hexadecimal code which is appended like so: indexffbb.html, etc. Also, the other files have different designations.

 

It would be trivial to write a program that found the right one, but you need to know the right one beforehand in order to set up the right filter so the whole purpose is defeated. Who knows how often they change it? That having been said, it really doesn't matter–the files are relatively small.

 

I think a setting of "3" is best. I pushed it to "4," and ended up getting far more than I wanted. Once your download completes, look for the folder of the person who's tweets you're collecting, and they should all be in there. It will be in "project name folder" >twitter.com >> "person whose tweets you're getting" >> status. For me they're in English.

 

The difference between a setting of "3" and "4", in my case, was about a 50x increase in my download size. Twitter put me on a time-out because of it, lulz.

Anonymous ID: 23d7ce April 11, 2018, 7:26 p.m. No.1006290   🗄️.is 🔗kun

>>1006231

Oh, and if the helps you wget users, the file structure is as I'd mentioned: twitter.com >(twitter handle, without the '@' in front) >> status >> *.html