2025 Barcamp Session Proposal: Using the OpenRefine LLM Extension

… with Local and Remote Models

Description

LLMs are powerful tools for cleaning and enriching data, extracting entities, and generating translations. Thanks to @Sunil_Natraj, there is an excellent AI extension for OpenRefine that enables the use of local and remote LLM models with apps and services like Ollama, llama.cpp,OpenRouter, as well as most other AI services based on the OpenAI API.

In this session, I will demonstrate how to install and set up the extension in OpenRefine. Following the demonstration, I would like to discuss use cases and applications for AI in the context of data wrangling.

Format

Presentation and Discussion
Duration: 30 minutes

5 Likes

This session was cancelled as @Michael_Markert was not available to present. Instead, @Martin did a small demo of how the extension worked, and the group discussed the usage of LLM with OpenRefine. The notes from the shared Etherpad are available here.

Interesting. I missed the demo. I managed to install the extension, but is good documentation on the next step (setting up an LLM provider in OpenRefine) available?

You can find some documentation directly in the GitHub repo. I suggest starting with those pages:

clean up notes from the pad

Participants discussed possible uses of the OpenRefine LLM extension and shared ideas on how it could support data wrangling workflows.

Potential use cases

Several categories of use cases were highlighted based on the extension documentation.

Content transformation

  • summarization
  • translation
  • style conversion
  • format standardization

These could be useful when preparing text datasets or normalizing descriptions across records.

Information extraction

  • entity recognition
  • key fact extraction
  • timeline creation
  • relationship mapping

Participants noted that LLMs could help extract structured information from unstructured text fields.

Content analysis

  • sentiment analysis
  • theme identification
  • category classification

These approaches may help classify or analyze textual datasets before further cleaning or enrichment.

Multimodality

The possibility of multimodal workflows was also mentioned.

For example, the extension could potentially:

  • analyze images and return structured descriptions
  • extract information from images
  • interpret textual descriptions of images

Participants suggested that combining this with controlled vocabularies or predefined datasets could help constrain outputs and make results easier to integrate into OpenRefine workflows.

Model Context Protocol (MCP)

Another topic raised was the potential relationship between the extension and the Model Context Protocol (MCP).

Supporting MCP could allow OpenRefine to interact with other tools or agents in a larger AI workflow, potentially allowing external systems to guide or orchestrate OpenRefine tasks.

A related discussion is available here: Should we develop a MCP Server for OpenRefine?

Data sharing and privacy considerations

Participants also emphasized the importance of considering what data is shared with LLM services.

When using hosted models through public APIs, standard privacy considerations apply:

  • understand what data is being sent to external services
  • verify whether organizational policies restrict sending certain datasets to external APIs
  • consider using local or self-hosted models if working with sensitive data

Choosing between local models and external services depends on the data being processed and institutional policies.