Mining information from PDF URLs using regex

Chris_Erdmann · December 19, 2022, 3:05am

I'm trying to use a Python script in OpenRefine via Edit Column > Add column based on this column (Jython/Python)... where the column has links to PDFs. The script I have is:

import pdftotext
from urllib.request import urlopen

target_url = value
file = urlopen(target_url)
pdf = pdftotext.PDF(file)

Match the regular expression against the contents of the PDF file

pattern = '\d\d.\d+/\S*'
matches = re.findall(pattern, "\n\n".join(pdf))
output = ';'.join(matches)
return output

Here is an example:
PDF link https://jnnp.bmj.com/content/jnnp/91/8/795.full.pdf

Any ideas on what I need to correct for this script to work? Or maybe there is a more efficient way to open a PDF URL and match patterns for inclusion in my OpenRefine project? Thanks!

b2m · December 19, 2022, 7:19am

There might be a more competent answer comming…

But as far as I know making external Python libraries like pdftotext work with OpenRefine 3 involves quite a lot of hacking. See my answer on Stackoverflow on a similar topic.

A common way to circumvent having to hack Jython is to wrap an external Python library via FastAPI or Typer and then delegate the calls from OpenRefine via HTTP requests or CLI calls.

Here is a small GIST showing how to do perform Named Entity Recognition with spaCy and OpenRefine using FastAPI:

gist.github.com

https://gist.github.com/b2m/6e2697ce182548a98320e4b7b7b885b6

ner-service.py

from typing import List

import spacy
import uvicorn
from fastapi import FastAPI
from pydantic import BaseModel, Field

app = FastAPI(
    title="NER service based on spaCy",
    description="""

This file has been truncated. show original

requirements.txt

fastapi==0.67.0
pydantic==1.8.2
python-multipart==0.0.5
spacy== 3.1.1
uvicorn==0.14.0

Also check my comment for a simplified version, that you could use as a starting template for a pdftotext service.

Topic		Replies	Views
Add column by fetching URLs results in a blank export Support and Helpdesk	2	292	May 30, 2023
OpenRefine access using python API Support and Helpdesk	1	564	February 16, 2023
Detecting patterns in a text mass Data cleaning and transformations	2	253	December 16, 2023
Get Redirected URL Support and Helpdesk	5	152	August 28, 2025
Using the OpenAI API to apply natural language queries to cells/data Support and Helpdesk hints-and-tips	5	873	February 4, 2023

Mining information from PDF URLs using regex

Match the regular expression against the contents of the PDF file

Related topics