>Here is python source code to extract all the words from fauci's email drop
That python code is using the 'natural language toolkit' (nltk).
'punkt' is a tokenizer.
Basically the code is using probability to choose the most likely words that would fit the non-redacted words in the pdf.
It is not extracting words, it is using probability to fill in the blanks.
Like throwing hotdogs down a hallway while riding a merry-go-round.
But the hotdogs would be more fun and accurate.