Henry Case ID: ae3e60 Archiving the Web April 9, 2018, 9:50 p.m. No.978477   🗄️.is 🔗kun   >>8833 >>2212 >>2298 >>3930 >>4029 >>5562 >>7352 >>7600

There are a LOT of options for archival, as listed in graphic (1) and article (2). There are synopses for Wget in (3) and (4).

 

I prefer Wget, for the simple reason of power and flexibility. Those of you who use *nix prolly already know this, but the mirror of choice is Wget, and that goes really well if you have access to s VPS. You can queue it up and then tgz and sftp it when it's complete. Sometimes it can take days to mirror a full site if they've got aggressive leech protection.

 

You'll want to be aware of the robots option and the retry option, if you notice a server blocking your access because of too many requests in rapid succession, or a bitchy robots.txt.

 

MAN:: Wget - The non-interactive network downloader.SYNOPSIS wget [option]... [URL]...OPTIONS Download Options -w seconds --wait=secondsWait the specified number of seconds between the retrievals. Use of this option is recommended, as it lightens the server load by making the requests less frequent. Instead of in seconds, the time can be specified in minutes using the "m" suffix, in hours using "h" suffix, or in days using "d" suffix.Specifying a large value for this option is useful if the network or the destination host is down, so that Wget can wait long enough to reasonably expect the network error to be fixed before the retry. The waiting interval specified by this function is influenced by "--random-wait", which see.

 

My recommended initial configuration is below, but I'm sure you can tailor it to suit your needs.

wget --mirror --page-requisites --adjust-extension --no-parent --no-clobber --no-check-certificate --convert-links -e robots=off https:// example.com/

 

Happy archiving.

Henry Case ID: ae3e60 Twitter WebScraping with Python April 10, 2018, 2:50 p.m. No.987606   🗄️.is 🔗kun   >>7830 >>7855

Twitter scraping can be achieved with something like Python. The juridical viability for use in a legal proceeding will vary by jurisdiction, as this is a prime vulnerability for distortion.

 

If you intend to use the data for a disposition, it might be best to scrape as well as printing (with timestamp) to PDF and/or hardcopy. When you want to admit it, you'll need it to be certified by the court, so the more supporting information, the better.

 

https:// medium.com/@dawran6/twitter-scraper-tutorial-with-python-requests-beautifulsoup-and-selenium-part-1-8e76d62ffd68

Henry Case ID: ae3e60 No Quick and Easy eDiscovery April 10, 2018, 3:22 p.m. No.988086   🗄️.is 🔗kun

There is NO quick and easy way to do juridical data collection. The cause is worth it, so put some sweat and tears into this shit. It's worth it.

http:// technology.findlaw.com/electronic-discovery.html

Henry Case ID: ae3e60 Complications April 11, 2018, 6:03 p.m. No.1004570   🗄️.is 🔗kun   >>1225

>>995761

You're over complicating the issue… in most cases, there is no need to encode the URL. So if you configure a shortcut for the "Open URL" service on most systems, it's a no-brainer.

Henry Case ID: ae3e60 Casting a wider net... April 11, 2018, 10:38 p.m. No.1008724   🗄️.is 🔗kun

>>988857

 

By default, wget grabs everything so it's actually better to cast a wide net when mirroring unknown servers, as you said.. you can find more hidden gems this way. If you add the flags, you'll limit your retrieval to those filetypes.