dChan - Q Origins Project Archive

Anonymous ID: eb8909 Jan. 18, 2020, 10:46 p.m. No.7852871 🗄️.is 🔗kun >>4333

I have 3 python services running 24/7, pulling data from the 8chan/8kun JSON APIs. One service stores new thread ids into a MongoDB, another services downloads all posts from new and any updated threads also loaded into MongoDB, and the third service downloads all attachments from posts and tracked in Mongo.

Been running this since 2018 and have over 1TB of content. Runs on linux and mac, not tested on Windows. Can be dockerized to run anywhere but never needed to.

I wanted to parse the notables and auto post to a simple website for normies. Started it but but never finished it.

I wish I could run this in the cloud and dump the data in elasticsearch but would probably have a lot of issues the the powers that be.

Anonymous ID: eb8909 Jan. 19, 2020, 9:47 a.m. No.7855083 🗄️.is 🔗kun >>5248

>>7854333

Mongodb is native JSON in/out so all the archived posts are stored in that format. Text can be indexed in Mongo so searching can be sped up.

I'm running this on an old laptop with an external drive. Each service is set to sleep for 1 min. after pulling new data

Were you trying to run elasticsearch in a multi node cluster? It's possible to run as a single node in a Docker container. https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html

Another process I wanted to run is image analysis to parse text in memes and store the text in a database to make the memes searchable. That would be fun. AWS has Amazon Rekognition service and would probable be great for this but I'd have to upload a ton of content and I'd be a little worried about exposing myself by using that service for this.

Anonymous ID: eb8909 Jan. 19, 2020, 10:54 a.m. No.7855549 🗄️.is 🔗kun >>5586

>>7855248

Gotcha. I'm running this all locally. If I were running this on a hosted server, the costs would be too much also.

Yea the Mega archives would probably work. I haven't tried this before but I would assume the main factors are contrast of the text against the background, the size of the text, and the clarity/sharpness.

>>7855303

For an initial batch process, I'm sure would take a lot of time depending on the cpu. After that I'm guessing a lower end system could process images as they were posted to the boards/downloaded here.

There's a lot of interesting features of Amazon Rekognition. It can parse text in images, objects, faces and celebrities! It works with video also. The advanced features would need training data as it is machine learning.

The main thing I was interested in was loading the text in a database with a link to the image and also be able to tag the data so i could easily search for a meme to be able to share it. So many times I remember who/what/text is in a meme but I can't find it quick enough.

Anonymous ID: eb8909 Jan. 19, 2020, 11:35 a.m. No.7855811 🗄️.is 🔗kun >>5867

>>7855586

100% agree. I have to use AWS for work. As I dove into it for work I quickly realized they are light years ahead of all other cloud services. Immediately thought of a 3 letter agency behind it