Mining information from PDF URLs using regex

There might be a more competent answer comming…

But as far as I know making external Python libraries like pdftotext work with OpenRefine 3 involves quite a lot of hacking. See my answer on Stackoverflow on a similar topic.

A common way to circumvent having to hack Jython is to wrap an external Python library via FastAPI or Typer and then delegate the calls from OpenRefine via HTTP requests or CLI calls.

Here is a small GIST showing how to do perform Named Entity Recognition with spaCy and OpenRefine using FastAPI:

Also check my comment for a simplified version, that you could use as a starting template for a pdftotext service.