dChan - Q Origins Project Archive

I did some quick testing with pytesseract and had mixed results. I didn't spend a lot of time on trying to tweak it. I tried running it without any training then training with the impact font(seems like a lot of meme images have that). I didn't get a clean output either way.

For anyone interested in trying this, you will need to install docker.

After you have docker installed:

Copy into Dockerfile:

FROM python:3.7

RUN apt-get update \

&& apt-get install tesseract-ocr -y

WORKDIR /app

COPY . /app

RUN pip install -r requirements.txt

Copy into requirements.txt:

pillow

pytesseract

opencv-python

Copy into ocr1.py:

from PIL import Image

import pytesseract

print(pytesseract.image_to_string(Image.open('memes/test1.jpeg')))

Then build the docker image:

docker build . -t ocr:latest

Then run the docker image and mount your code directory (this will allow you to make code changes without having to rebuild the image):

docker run -v ~/code/pytesseract:/app -it ocr:latest bash

NOTE: I have ~/code/pytesseract, you will need to change this if your code location is different

This will bring you into your docker container and you can run your code in there:

python ocr1.py

Under your code directory you can access your images. I have a 'memes' subdir. Whatever png or jpg image you copy there should be accessible for the OCR to run against.

This file:

http://ahijackedlife.com/wp-content/gallery/qanon-memes-3f/Anderson-its-OK.jpg

resulted in output:

ANDERSON, IT'S

WE GOT THIS= WELL

IS A"CONSPIRACY THEORY”

And with the Impact font training data output:

ANpERSoNl lTS

YGoNEBEoK

WEGoTATHlSNWEfLL

TELL lEM THAT THE sToRM

lsAUcoNSPlRAcv THEonU

Larger example here:

https://www.pyimagesearch.com/2017/07/10/using-tesseract-ocr-python/

Free online font training:

http://trainyourtesseract.com/

If you train fonts, copy your trained data to your native OS code directory, then in your docker environment, copy it to the tesseract data directory inside of docker:

cp training/Impact.traineddata /usr/share/tesseract-ocr/4.00/tessdata/.

I hope this is readable!