Mining information from PDF URLs using regex

b2m · December 19, 2022, 7:19am

There might be a more competent answer comming…

But as far as I know making external Python libraries like pdftotext work with OpenRefine 3 involves quite a lot of hacking. See my answer on Stackoverflow on a similar topic.

A common way to circumvent having to hack Jython is to wrap an external Python library via FastAPI or Typer and then delegate the calls from OpenRefine via HTTP requests or CLI calls.

Here is a small GIST showing how to do perform Named Entity Recognition with spaCy and OpenRefine using FastAPI:

gist.github.com

https://gist.github.com/b2m/6e2697ce182548a98320e4b7b7b885b6

ner-service.py

from typing import List

import spacy
import uvicorn
from fastapi import FastAPI
from pydantic import BaseModel, Field

app = FastAPI(
    title="NER service based on spaCy",
    description="""

This file has been truncated. show original

requirements.txt

fastapi==0.67.0
pydantic==1.8.2
python-multipart==0.0.5
spacy== 3.1.1
uvicorn==0.14.0

Also check my comment for a simplified version, that you could use as a starting template for a pdftotext service.

Topic		Replies	Views
Add column by fetching URLs results in a blank export Support and Helpdesk	2	293	May 30, 2023
OpenRefine access using python API Support and Helpdesk	1	565	February 16, 2023
Detecting patterns in a text mass Data cleaning and transformations	2	260	December 16, 2023
Get Redirected URL Support and Helpdesk	5	158	August 28, 2025
Using the OpenAI API to apply natural language queries to cells/data Support and Helpdesk hints-and-tips	5	879	February 4, 2023

Mining information from PDF URLs using regex

Related topics