Anonymous ID: d6b0f8 March 6, 2018, 11:11 a.m. No.569596   🗄️.is 🔗kun

>>494745

I have created a searchable application for /qresearch/.

 

The database is filling right now. I kept only the image attachments in order to save hard disk space.

 

At present 52,000 of the most recent posts on qresearch are loaded in the table with the attachments. We'll see how the storage works out.

 

I'll advise when anons can attempt to use the system.

Anonymous ID: d6b0f8 Searchability March 14, 2018, 12:51 p.m. No.664928   🗄️.is 🔗kun

>>494745

Searchable Qresearch

www.pavuk.com

username: qanon

password: qanon

 

Updated regularly with the messages and images from Qresearch general.

Anonymous ID: d6b0f8 March 14, 2018, 12:56 p.m. No.664975   🗄️.is 🔗kun

I'm using the 8chan JSON API endpoints. I still need to pull from the archive.json file downloaded yesterday.

 

My server is on a linode so I have fast response time.

Anonymous ID: d6b0f8 March 14, 2018, 12:58 p.m. No.664990   🗄️.is 🔗kun   >>6471

You can search the text is the posts with wildcards. Say you want all posts with the word BOOM. Just enter boom.

 

Say you want the posts from Q with his tripcode and "boom"

 

Put !UW.yye1fxo in the trip code.

put boom in the comment

Click search button

 

voila.

Anonymous ID: d6b0f8 March 14, 2018, 1:06 p.m. No.665054   🗄️.is 🔗kun

U.POSTS.NEW is the new-format table.

U.POSTS.NEW.ATT is the table of attachment for the primary table. Each one is a link to a binary

Anonymous ID: d6b0f8 March 14, 2018, 4:45 p.m. No.666959   🗄️.is 🔗kun

>>666471

I've not been back into this thread for a while. I'm running the qresearch import process to get up-to-date. One technique that is needed is to re-scan already imported threads for posts missed during initial scans.

 

Threads are imported from the catalog.json file. In this state, we know the thread number and the number of messages at that time. The only time we know a thread is closed is when the number of posts >= the number in the official "bake" count.

 

Therefore, my program keeps testing until the posts counter >= the bake counter and then marks the thread as complete in the thread table. This then prevents re-scanning all threads because we get only the open ones.

 

Multiple scans of posts are needed to get all of them and to deal with duplicate threads.

 

I use the 8-chan post number as part of the primary key to the threads and posts tables.

 

8GA_1 is 8chan Great Awakening post 1

8QR_655000 is 8chan Qresearch post 655000

 

The big problem is going back to find threads BEFORE the last 25 pages in the catalog.json. Therefore, I can't get anything earlier than when I first wrote the import.

Anonymous ID: d6b0f8 Timestamps March 14, 2018, 4:47 p.m. No.666983   🗄️.is 🔗kun   >>7927

The import routine uses the JSON API endpoint from the boards. In the JSON is the Unix timestamp of the message. This is a native field/object type in Pavuk. Thus all timestamps are set to UTC internally.

 

NOW, if I could get DJT's Twitter feed in JSON, it also has UnixTime and this goes in directly.

 

Twitter wants me to give them all sorts of documentation before they will allow me to use their API. Frankly, I don't have the time to deal with them or the inclination.

Anonymous ID: d6b0f8 Other boards March 14, 2018, 4:48 p.m. No.666995   🗄️.is 🔗kun   >>7971 >>8375 >>0221 >>0322

I can get other boards provided the endpoints are similar and that the catalog.json file still has links to the threads.

 

BO has never responded to my requests on how to get older threads.

Anonymous ID: d6b0f8 Searching with Pavuk March 14, 2018, 4:50 p.m. No.667022   🗄️.is 🔗kun

Super simple.

 

Entry forms are also search forms.

Enter the data that you wish to match.

Click the search button.

 

Pavuk creates and then executes the appropriate query and returns the items in a Kendo grid. Scroll, resort, export to excel or click on a row to return to the entry form with your data.

 

searching on timestamps has issues that i need to resolve

Anonymous ID: d6b0f8 COMMENTS scrubbed with Lynx March 14, 2018, 4:57 p.m. No.667075   🗄️.is 🔗kun   >>7776

The comments from the JSON API include markup and JS to go to real links. This is a problem with the storage and search. I pipe the comment string through Lynx with the -dump option and this gives me clean text in STDOUT and then a separator and then the list of actual links. I put the text in the comments and the links in a multivalue table. I'll expose the links tomorrow as a separate tab in the entry form.

Anonymous ID: d6b0f8 Import procedure debugging view March 15, 2018, 5:47 a.m. No.672305   🗄️.is 🔗kun   >>0213

I can get the other boards and other threads, the issue is disk storage. Linode gives me a lot of bandwidth, but only a few gigs of disk until I change my plan with them.

Anonymous ID: d6b0f8 Limits March 15, 2018, 5:53 a.m. No.672334   🗄️.is 🔗kun   >>0263

The limit of an OpenQM hash file (table) is 16TB. When this becomes a problem, I can create a distributed file (table) by primary key. Say, put all 8QR in 1 portion, 8GW in another. Simply a way to have physical storage allocated

 

Pavuk session records are GUIDS. (don't worry, I'll purge anons out of the storage.) It was done because of commercial requirements for SOX and other audit compliance issues. Remember, I created Pavuk to build commercial apps.

 

The distributed file is built by using the first 2 bytes of the GUID from the primary key. Thus, it has component files:

 

00

01

FE

FF

 

Or 256 parts.

 

Theoretical table size:

256 x 16TB = 4096TB

 

www.openqm.com

Anonymous ID: d6b0f8 No JSON for older threads :( March 15, 2018, 6:30 a.m. No.672572   🗄️.is 🔗kun

Brother Anons, I can find the IDs of the threads by using the search function on Archive.is. For example, research general #2 was post number 799. Once I know this, I can go back to 8chan and pull up the thread.

 

Sadly, I cannot get it with JSON. I only can get HTML. This means parsing the HTML.

 

This means a new string parser, but it goes into the same table as the JSON, but with more work. Here's what the posts look like in HTML

Anonymous ID: d6b0f8 Crowdfunding more resources March 15, 2018, 6:40 a.m. No.672659   🗄️.is 🔗kun   >>2709

I've put out a tweet thread showing the progress and asking if someone will step up to help lead a crowdfunding campaign so I can afford a bigger Linode.

Anonymous ID: d6b0f8 March 15, 2018, 7:11 a.m. No.672886   🗄️.is 🔗kun   >>2980 >>4536

>>672709

I was just going to have folks send to my personal paypal account since I'm funding the site anyway. You can set up a regular monthly payment. I do that with others like Stefan Molyneux where we send $10/month.

Anonymous ID: d6b0f8 IF I GET HELP FUNDING... March 15, 2018, 7:13 a.m. No.672908   🗄️.is 🔗kun   >>4321 >>0546

We need to work together to get all of the data into the database. If someone could help with a Twatter feed from DJT - preferably raw and in JSON, that can be added to the posts table.

Anonymous ID: d6b0f8 March 15, 2018, 7:15 a.m. No.672920   🗄️.is 🔗kun

>>672688

That was helpful. I would ask people in this thread to help develop the information model.

 

There is a "boards" table with the links to get data for each type. It can be expanded into which boards are archived where and I can automate the pulls.

Anonymous ID: d6b0f8 Pavuk Searchable March 26, 2018, 5:25 p.m. No.803777   🗄️.is 🔗kun   >>5300 >>9411

Linode is telling me that I can get block storage, but only by migrating my VM to the Fremont data center, getting a new IP address (SSL cert. etc.)

 

Crickets from followers whom I've asked to donate funds for the added expenses.

Anonymous ID: d6b0f8 March 27, 2018, 6:23 a.m. No.809001   🗄️.is 🔗kun   >>9048

>>805300

What other options?!?!

"Archive EVERYTHING OFF LINE"

"MAKE IT SEARCHABLE"

 

If I don't have enough storage, where am I going to store the data?

 

If you don't know about IT, you should not be in this thread.