Mining information from PDF URLs using regex

I’m trying to use a Python script in OpenRefine via Edit Column > Add column based on this column (Jython/Python)… where the column has links to PDFs. The script I have is:

import pdftotext
from urllib.request import urlopen

target_url = value
file = urlopen(target_url)
pdf = pdftotext.PDF(file)

Match the regular expression against the contents of the PDF file

pattern = ‘\d\d.\d+/\S*’
matches = re.findall(pattern, “\n\n”.join(pdf))
output = ‘;’.join(matches)
return output

Here is an example:
PDF link https://jnnp.bmj.com/content/jnnp/91/8/795.full.pdf

Any ideas on what I need to correct for this script to work? Or maybe there is a more efficient way to open a PDF URL and match patterns for inclusion in my OpenRefine project? Thanks!

There might be a more competent answer comming…

But as far as I know making external Python libraries like pdftotext work with OpenRefine 3 involves quite a lot of hacking. See my answer on Stackoverflow on a similar topic.

A common way to circumvent having to hack Jython is to wrap an external Python library via FastAPI or Typer and then delegate the calls from OpenRefine via HTTP requests or CLI calls.

Here is a small GIST showing how to do perform Named Entity Recognition with spaCy and OpenRefine using FastAPI:

Also check my comment for a simplified version, that you could use as a starting template for a pdftotext service.

1 Like