>>14193
Hi,
>how are you currently identifying general breads?
the first part is a simple regex against the thread title:
/Q\sResearch\sGeneral\s#\s[\d,]+/i
this can and will fail, the most common reason is if bakers make a mistake in the bread number or the title
so once every few days we look at what the scraper has caught and clean everything up manually
the notables parsing is also somewhat self-correcting since notables for bread #1000 usually stay around in the dough until #1003 or so. that gives bakers a chance to fix mistakes in the next bread and then have the notables from previous breads picked up.
>Baker is identified by the very first post of a bread. I will also go backwards from 1 general to the previous to figure out baker changes and mark posts accordingly.
I wouldn't recommend this. there is a loose process for baker handoffs but it's for humans, not computers, so it's not precise enough to be parsed. there's also no way to link users together across different threads since IDs change with every thread. there is all kinds of other stuff that can happen like 3 different bakers/notable collectors working on the same long bread during nightshift.
>The idea also was to archive full size pics + videos when a baker UID touched the posts. Baker is identified by the very first post of a bread.
probably the best you will be able to do using this approach is to make it work for about 60-75% of breads, again due to handoffs.
the approach wearethene.ws takes instead is to validate each new bread and then archive images from the posts that were quoted in the notables section. this probably works for about 95% of breads.
our site doesn't archive videos, PDFs, or animated gifs due to space requirements so that would definitely be valuable.
there aren't really any tricks or standards in general, just "whatever works". wearethene.ws only scrapes the HTML, I seem to remember some issues with the JSON threads list especially being out of sync with the rest of the board, but I don't have any definite info on that.
if there are any API endpoints wearethene.ws can provide that would make your projects easier let us know and we'll see if it's feasible. you're also welcome to scrape the HTML of course.