Anonymous ID: 58512e Jan. 29, 2018, 6:51 a.m. No.199963   🗄️.is 🔗kun   >>0033

AUTOMATED PDF DIGGING

This is a method I used to dig Hillary's emails you may find useful.

-Requires Linux type Operating System

-Ensure you have pdftotext installed

-Put list of pdf URLs into a file, one URL on each line.

-Save as allpdfs.txt

-Run following command from terminal:

for i in grep -E '^htt.*\.pdf$' allpdfs.txt;do foo=$(basename $i);wget $i; pdftotext $foo; rm $foo; done

It will get each pdf by its URL in the allpdfs.txt file, change it into a text file, save it with the name of the pdf, but with extension txt.

For example, if a line in the file "allpdfs.txt" contains:

http:// some.website.com/a/b/c/d/grab_this.pdf

the output would be

grab_this.txt

which is a text conversion of the pdf document.

The original pdf is deleted to save space (remove the "rm $foo;" part if you want to keep the original pdf too).

CAVEATS: Obviously pictures are ignored. Transformations may include spelling mistakes, especially for scanned pdfs, and depends on document clarity.

 

DIGGING

Now you have a text version of your pdfs you can simply search them all with grep

#basic search:grep keyword txt #search for a phrase:grep 'some phrase' txt#case insensitive search (will find keyword, Keyword, kEYWORD, etc):grep -i keyword *txt #consult grep manual for other options:man grep

Anonymous ID: 58512e Jan. 29, 2018, 7:06 a.m. No.200033   🗄️.is 🔗kun

>>199963

Finding the source pdf URL from a keyword match in the text files:

#search text files for "keyword" to get the name of the text file, remove the .txt extension from the name and use that to search through "allpdfs.txt" containing the URLs. Add each match found into a file called URLs_of_pdfs_matching_keyword.txt:for i in grep -l keyword *.txt;do j=$(basename $i .txt); grep $j allpdfs.txt >>URLs_of_pdfs_matching_keyword.txt;done#display the contents of the text file:cat URLs_of_pdfs_matching_keyword.txt