We might be able to put the GPU to some use. The decoding part obviously has too much conditional branching for it to be of any use there. But the Permutation generation step is highly linear. It should be well suited to parallelization. It could be sent perspective passwords and a sizeN and send back an arrays. However, it would be memory bound. And the huge bandwidth requirements to send those arrays back to the main memory might be an issue.
I found the source for all the parts of SecureRandom and plan on making a perfect replica of it in C as a stepping stone to a possible GPU implementation. That is extremely ambitious for someone with my coding skill-level. But I can to it… eventually.