2025 Barcamp Session Proposal: Using the OpenRefine LLM Extension

clean up notes from the pad

Participants discussed possible uses of the OpenRefine LLM extension and shared ideas on how it could support data wrangling workflows.

Potential use cases

Several categories of use cases were highlighted based on the extension documentation.

Content transformation

  • summarization
  • translation
  • style conversion
  • format standardization

These could be useful when preparing text datasets or normalizing descriptions across records.

Information extraction

  • entity recognition
  • key fact extraction
  • timeline creation
  • relationship mapping

Participants noted that LLMs could help extract structured information from unstructured text fields.

Content analysis

  • sentiment analysis
  • theme identification
  • category classification

These approaches may help classify or analyze textual datasets before further cleaning or enrichment.

Multimodality

The possibility of multimodal workflows was also mentioned.

For example, the extension could potentially:

  • analyze images and return structured descriptions
  • extract information from images
  • interpret textual descriptions of images

Participants suggested that combining this with controlled vocabularies or predefined datasets could help constrain outputs and make results easier to integrate into OpenRefine workflows.

Model Context Protocol (MCP)

Another topic raised was the potential relationship between the extension and the Model Context Protocol (MCP).

Supporting MCP could allow OpenRefine to interact with other tools or agents in a larger AI workflow, potentially allowing external systems to guide or orchestrate OpenRefine tasks.

A related discussion is available here: Should we develop a MCP Server for OpenRefine?

Data sharing and privacy considerations

Participants also emphasized the importance of considering what data is shared with LLM services.

When using hosted models through public APIs, standard privacy considerations apply:

  • understand what data is being sent to external services
  • verify whether organizational policies restrict sending certain datasets to external APIs
  • consider using local or self-hosted models if working with sensitive data

Choosing between local models and external services depends on the data being processed and institutional policies.