dChan - Q Origins Project Archive

- 4chan
- 8chan/8kun
- Reddit
- BBSPink
- /a/
- /abcu/
- /aiproto/
- /alleycat/
- /animu/
- /asatru/
- /bmn/
- /christian/
- /civicrsturbo/
- /clt/
- /comms/
- /cow/
- /cuteboys/
- /cyoa/
- /d/
- /deepdigs/
- /dempart/
- /doughlist/
- /film/
- /gamergatehq/
- /gnosticwarfare/
- /hgg/
- /hivemind/
- /ipfs/
- /k/
- /mg/
- /midnightriders/
- /miku/
- /monster/
- /nep/
- /newsplus/
- /nofap/
- /pdfs/
- /pen/
- /philogeometric/
- /pnd/
- /pone/
- /pundit/
- /qnotables/
- /qrb/
- /qrmemes/
- /s/
- /t/
- /tech/
- /techbunker/
- /test/
- /truthlegion/
- /tv/
- /v/
- /vqc/
- /warroom/
- /wtp/
- /x/
- No drops in this thread.
- Time series visualizer

About this project

Support Us ❤

BAKER

Thanks for adding this to the notables (lb).

>>5157325 (lb) - A Call to Codefags/Shovels

I didn't think anyone noticed… so I said fuck it and did the first round of processing myself.

Guess I'm outing myself as a closet'd codefag, kek.

—

U.S. House of Representative, Financial Disclosures '08-'17

https://pastebin.com/BqXv6avq

CSV format, direct link to .PDF's are in the "Sauce" column.

—

So… who wants to do some PDF parsing out there, any takers?

Anonymous ID: d3165d Feb. 13, 2019, 11:23 a.m. No.5158147 🗄️.is 🔗kun >>8303 >>8437

Well fuck, looks like 2017 got processed multiple times instead of each year individually.

Will post an update after another round/attempt at processing.

Anonymous ID: d3165d Feb. 13, 2019, 11:34 a.m. No.5158273 🗄️.is 🔗kun >>8323 >>8437

Greetings fellow codefag.

Looping over all years '08-'17 now as well speak, yay for macros. A shitty/slow approach but was stupid fast and easy to process the dataset this way.

Should have the full, correct output soon.

Could definitely use help with parsing the individual PDFs to text, json, etc, you know something structured. Can you help with that?

Anonymous ID: d3165d Feb. 13, 2019, 11:45 a.m. No.5158437 🗄️.is 🔗kun >>8530

Thanks for the report on the DWS 404, anon.

Might be a processing issue on my part or perhaps truly a 404.

If there's a codefag out there that could HTTP HEAD request the batch that would be helpful in determining where the issue lie.

If we can follow the PDFs, and parse them all to something structured, we can get the whole mess into an indexed and searchable form.

We could also provide a grouping over each rep with all their respective PDFs.

Lastly it grants other anons an easy way to mass download all PDFs, or a subset due to its grep-able form.

Very close. Here's the first phase proposal.

Grab all .ZIPs for 2008-2017

Collate .txt or .xml files data to flat .csv, add "Sauce" column linking directly to each PDF

Essentially, http://clerk.house.gov/public_disc/financial-pdfs/<Year_col6>/<DocID_Col8>.pdf

Note: must be http, these arent being served over https

Loop over the pdf links, verify existence with HEAD req returning 200.

Post the verified dataset for anons to devour

—

Second phase would involve actually pulling each individual pdf down, parsing it from pdf =csv|json|xml (ie ideally something structured, or semi-structured if we must).

That would be the first step in granting anons the ability to deep search this whole mess of fugly pdfs.

Open to better ideas, this is just kinda something im running with atm.