dChan - Q Origins Project Archive

AUTOMATED PDF DIGGING

This is a method I used to dig Hillary's emails you may find useful.

-Requires Linux type Operating System

-Ensure you have pdftotext installed

-Put list of pdf URLs into a file, one URL on each line.

-Save as allpdfs.txt

-Run following command from terminal:

for i in grep -E '^htt.*\.pdf$' allpdfs.txt;do foo=$(basename $i);wget $i; pdftotext $foo; rm $foo; done

It will get each pdf by its URL in the allpdfs.txt file, change it into a text file, save it with the name of the pdf, but with extension txt.

For example, if a line in the file "allpdfs.txt" contains:

http:// some.website.com/a/b/c/d/grab_this.pdf

the output would be

grab_this.txt

which is a text conversion of the pdf document.

The original pdf is deleted to save space (remove the "rm $foo;" part if you want to keep the original pdf too).

CAVEATS: Obviously pictures are ignored. Transformations may include spelling mistakes, especially for scanned pdfs, and depends on document clarity.

DIGGING

Now you have a text version of your pdfs you can simply search them all with grep

#basic search:grep keyword txt #search for a phrase:grep 'some phrase' txt#case insensitive search (will find keyword, Keyword, kEYWORD, etc):grep -i keyword *txt #consult grep manual for other options:man grep