Anonymous ID: 2b1f02 May 18, 2020, 4:30 a.m. No.14193   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun   >>4203

Hello?

 

Is the maintainer of wearethene.ws on here?

I have a little question from the technical side.

 

I'm currently implementing that generals are automatically archived using archive.org.

 

The idea also was to archive full size pics + videos when a baker UID touched the posts.

 

Baker is identified by the very first post of a bread. I will also go backwards from 1 general to the previous to figure out baker changes and mark posts accordingly.

 

And now to the technical question:

how are you currently identifying general breads?

I guess it must be some kind of heuristic, like some text or a regular expression matching.

 

Can you please tell me what you do?

I would like to stay compatible with everything else.

 

If you also got some tricks working through the JSON data, please tell me as well.

Anonymous ID: 2b1f02 July 7, 2020, 12:58 a.m. No.18815   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun

>>18796

https://www.ctrl.blog/entry/arcane-robotstxt-directives.html

 

Crawl-delay

Sets a delay between each new request to the website. For example, Crawl-delay: 12 tells a crawler to wait 12 seconds between each request; limiting it to no more than five-page requests per minute.

 

This directive is recognized by Bing, Yandex (45 % market share in Russia, and 20 % in Ukraine), Naver (40 % market share in South Korea), and Mail.Ru (5 % market share in Russia).

 

Due to the distributed nature of search crawlers, you may see more requests than expected as its unclear whether the limit applies to the entire pool of crawlers or applies to each individual crawler. Bing specifies that the limit is applied to their entire crawler pool, but none of the other search engines provide any documentation on this.

 

Request-rate

A variation of Crawl-delay that sets the request rate rather than a delay between each request. For example, Request-rate: 5/1m isnโ€™t equivalent to Crawl-delay: 12 as all five requests could be filled in the first few seconds within one minute. (Request-rate: 1/12s would be equivalent.)

AutoArchiveAnon !!!OGY0YjcwNDFlOTMz ID: 2b1f02 Aug. 11, 2020, 12:53 a.m. No.22498   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun   >>2521

>>22468

On it.

I noticed yesterday that the deserializer that I'm using for the JSON file doesn't decode \uXXXX Unicode characters properly. Need to fix that first, so that the text isn't garbled up. Wouldn't matter that much in Generals, but for country specific threads it's a problem.

 

I should be able to make it so it identifies and extracts the notables from each bread (always from the follow-up thread), and then creates a nice list for me to copy+paste anywhere.

 

I'm ignoring links to threads, only taking in links to specific posts, that seemed to work quite nicely and even if a baker fucks up the links in one thread, it would get the links from the follow-up thread.

 

>>22477

When I'm done with my collection code, I can either put the exact logic I'm using on here, or even put the actual code on here. I'm using SAP / ABAP, so maybe the code won't be that useful, you need a working SAP system up and running to use it. There is a demo system free to download called NPL, but then you have to have some SAP knowledge to use it.

AutoArchiveAnon !!!OGY0YjcwNDFlOTMz ID: 2b1f02 Aug. 11, 2020, 12:55 a.m. No.22499   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun

>>22477

What I heard is that wearethene.ws takes the HTML and interprets that. I'm using the JSON data instead. JSON works fine except that sometimes the JSON data is gone. Happened right now again with one thread, but I archived it before the JSON was gone. It's really weird.

AutoArchiveAnon !!!OGY0YjcwNDFlOTMz ID: 2b1f02 Aug. 13, 2020, 2:30 a.m. No.22541   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun

>>22532

I'm using an existing library for the parsing, but the library doesn't have \uXXXX support. They fixed that later, but I can't access that patch using my NPL-System and at work we don't have a 7.5x+ system yet, and that patch is for 7.5x+ systems only. The NPL system is 7.52, but it counts as a demo, and it even has some weird crap in the SQL database, so that you can't grow it forever. There is a trick though and you can remove that restriction (which I did).

 

The good thing about SAP is that almost everything is basically open source, so you can check what they are doing (and in normal systems even modify it when there is no other way).

 

The NPL system is also running under Linux (SAP system is available for Windows as well as other OS).

 

My code seems to be working properly, I even checked what 8kun is doing when JSON related characters are in a post. Those are not replaced with \uXXXX, but with for example \", \ etc. If these characters came in using \uXXXX, I would still replace them with escaped characters, so that the library doesn't have a problem with it.

 

If my bandwidth was big enough, I could download all videos and upload them to somewhere else. An idea would also be to upload it to anonfiles.com. As far as I can remember they don't delete files unless it really violates the law.

 

Agreed about PDFs. I really don't understand why 8kun accepts PDFs, but at the same time plain text files are rejected. Plain text is not dangerous at all.

 

PDFs are archived on wayback machine as well by myself. Sadly archive.IS also doesn't support PDF archiving, otherwise I would do that too.

 

All threads are archived there though, so if you offer links to the original threads, you can offer links to the archived threads on wayback machine as well as archive.IS. Typically archive.IS threads are better, I haven't seen it once that pictures were missing. On wayback machine that happens sometimes.

Anonymous ID: 2b1f02 Sept. 15, 2020, 4:26 a.m. No.25535   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun   >>8764

Can you please change video URLs to stop going to invidio.us, because that instance is down, and you get a selection which includes jewtube.com, which is kinda ridiculous.

 

You can use

https://invidiou.site/watch?v=XXXX

AutoArchiveAnon !!!OGY0YjcwNDFlOTMz ID: 2b1f02 Sept. 23, 2020, 3:12 a.m. No.27145   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun   >>7146

>>27035

I'm still on it. It got complicated because of JSON decoding, which is required for properly extracting notables, plus there is also a problem right now with archive.today archiving. That site activated Captcha since end of last month, and I'm still looking for a way around it. Strangely it works fine when I use the same code on Windows, but I get the captcha on Linux. Maybe has to do w/ slightly different build of curl.

Anonymous ID: 2b1f02 Jan. 4, 2021, 7:08 a.m. No.44457   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun   >>4459

>>44202

>I offered a way for those not knowing, while the scraper was down, to view current notables

Yes, and there is a thread exactly for that purpose, so that you can easily go through all notables in a row.

 

The thread is even filled by another scraper that is running every few hours, and also archiving the threads + contents too.

 

And I'm pretty sure wearethene.ws wasn't even really down / scraper not running, they got checks on there and because of all these stupid bakery fights, bread numbers mismatched several times in the last days, which then makes the scraper go "wtf" and someone has to go in manually and fix it.

 

Maybe these bakery fights even are about that chaos to make scrapers not work anymore.

Some baker even went to the notable thread yesterday and tried to create chaos in there as well.

 

Anyone on here should know that you can visit the current bread too for the brand new latest ones if the baker actually did a proper job (and if there even is a baker in the first place, several breads yesterday were e-bakes).

Anonymous ID: 2b1f02 Jan. 4, 2021, 7:14 a.m. No.44458   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun

>>44402

>you can count on the site โ€œhaving issuesโ€ when itโ€™s crunch time

 

Blame the silly baker fights.

Blame the bakers who cause chaos intentionally.

Weird that when shit gets real, there are these fights causing all sorts of problems.

 

I mean look at the notables alone from that baker. It's retarded. I can't decide if it's someone who went full schizo, or if it's a shill.

Anonymous ID: 2b1f02 Jan. 7, 2021, 2:58 a.m. No.44536   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun

>>44459

>"anon notable thread"

There is no anon notable thread.

 

It's a collection thread for official baker notables and that's what is done in there. Collecting baker notables. Nothing more, nothing less.

Similar to what wearethene.ws is doing.

 

>There are no "bakery fights."

Yes, there are.

 

>helped by both shills and crazies

Who are you to decide who is who?

Maybe you are the shill, I don't know.

Fact is there are bakery fights with bread hijacks, duplicate breads and other shit and that sucks.

 

Maybe that's because one side thinks the other side is shills + crazies. Maybe both sides think the other side is shills + crazies. Maybe both sides are crazies. It's not me to decide who is who. I think the side that spammed the official notable collection thread with nonsense and bullshit is crazy and wants to create chaos.

Anonymous ID: 2b1f02 Jan. 7, 2021, 2:59 a.m. No.44537   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun   >>4571

>>44464

>3. Or you can enjoy what you have been given for free, using an enormous amount of time that could have been spent doing other things.

 

You are doing God's work.

Anyone complaining can simply create something similar/better. It's not your fault that there are bakery fights.

God bless.

Anonymous ID: 2b1f02 Feb. 1, 2021, 3:18 a.m. No.45171   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun

There seems to be a bug with certain pictures.

 

For example notable 12786994.

Wearethene.ws shows a large X, although the post has a working .png attached.

 

Maybe there is a problem downloading certain pictures, idk.

Anonymous ID: 2b1f02 Feb. 25, 2021, 2:22 a.m. No.46240   ๐Ÿ—„๏ธ.is ๐Ÿ”—kun

>>46194

No, wearethene.ws is catching notables on the /qresearch/ board.

 

If you have valuable news, just post them in the General thread. The baker may then mark it as notable and thus it will show up on wearethene.ws

 

It's a good idea to archive the news first, using either wayback machine, or archive.is

 

It's also a good idea to make a screenshot on top of that, and post all of these together.

 

You can make screenshots of webpages using:

https://www.site-shot.com/

That works with most pages, there are also plugins for Pale Moon or Firefox, like for example "FireShot".