Anonymous ID: d3165d Feb. 13, 2019, 11:20 a.m. No.5158120   🗄️.is 🔗kun   >>8147 >>8216 >>8315 >>8437

BAKER

 

Thanks for adding this to the notables (lb).

 

>>5157325 (lb) - A Call to Codefags/Shovels

 

I didn't think anyone noticed… so I said fuck it and did the first round of processing myself.

 

Guess I'm outing myself as a closet'd codefag, kek.

 

 

U.S. House of Representative, Financial Disclosures '08-'17

 

https://pastebin.com/BqXv6avq

 

CSV format, direct link to .PDF's are in the "Sauce" column.

 

 

So… who wants to do some PDF parsing out there, any takers?

Anonymous ID: d3165d Feb. 13, 2019, 11:34 a.m. No.5158273   🗄️.is 🔗kun   >>8323 >>8437

>>5158216

 

Greetings fellow codefag.

 

Looping over all years '08-'17 now as well speak, yay for macros. A shitty/slow approach but was stupid fast and easy to process the dataset this way.

 

Should have the full, correct output soon.

 

Could definitely use help with parsing the individual PDFs to text, json, etc, you know something structured. Can you help with that?

Anonymous ID: d3165d Feb. 13, 2019, 11:45 a.m. No.5158437   🗄️.is 🔗kun   >>8530

>>5158120

>>5158147

>>5158215

>>5158273

 

>>5158315

 

Thanks for the report on the DWS 404, anon.

 

Might be a processing issue on my part or perhaps truly a 404.

 

If there's a codefag out there that could HTTP HEAD request the batch that would be helpful in determining where the issue lie.

 

>>5158303

 

If we can follow the PDFs, and parse them all to something structured, we can get the whole mess into an indexed and searchable form.

 

We could also provide a grouping over each rep with all their respective PDFs.

 

Lastly it grants other anons an easy way to mass download all PDFs, or a subset due to its grep-able form.

Anonymous ID: d3165d Feb. 13, 2019, 12:03 p.m. No.5158645   🗄️.is 🔗kun

>>5158530

 

Very close. Here's the first phase proposal.

 

  1. Grab all .ZIPs for 2008-2017

 

  1. Collate .txt or .xml files data to flat .csv, add "Sauce" column linking directly to each PDF

 

Essentially, http://clerk.house.gov/public_disc/financial-pdfs/<Year_col6>/<DocID_Col8>.pdf

 

Note: must be http, these arent being served over https

 

  1. Loop over the pdf links, verify existence with HEAD req returning 200.

 

  1. Post the verified dataset for anons to devour

 

 

Second phase would involve actually pulling each individual pdf down, parsing it from pdf =csv|json|xml (ie ideally something structured, or semi-structured if we must).

 

That would be the first step in granting anons the ability to deep search this whole mess of fugly pdfs.

 

Open to better ideas, this is just kinda something im running with atm.