Using local ChatGPT-like LLMs in OpenRefine for data wrangling

Sunil_Natraj · February 1, 2025, 6:33am

You are right; there will be no tight coupling with any platform / service.

thadguidry · February 1, 2025, 11:11am

Side note: You can definitely see how horrible some of the LLM models still are with data extraction in your demo. The researcher is actually François Peronnet and has no publicly available vital record information. Llama 3+ vs. Llama 2 is no better.

However, when it comes to Entity Extraction and this kind of data extraction, my research has shown that the search engine companies and their LLM's are decently better, such as Bing Copilot (Deep search), and Google's Gemini when it comes to Entity Extraction as the case is here, where you will get their full names in the result, instead of abbreviations. Gemini, Grok2+3, Llama, Anthropic, and the others, give decent info about their research efforts in results, but fail when further context is given (JSON containing vital information for entities extracted). But all of them do a good job of refusing to give personal details (date of birth/death) in a response other than name and academic bio, citation info.

Good demo however of the flow, but your requested info is pushing the boundaries of what the non-search engine LLM's are capable of. (Grok is a small exception).

archilecteur · February 1, 2025, 2:49pm

It's taking shape, great work Sunil!

One important point I'd like to come back to is the output constraint. For feature extraction, I'd say that the proposed framework is now outdated. When we call the service, it's common to constrain the output format by selecting a “structured output” parameter: “json_schema”, and passing the reinforced schema in the call.

In the demo, if you analyze the result of the request to the model, it's easy to see that the model returns a part of the response that matches the requested structure, but this part is preceded and followed by unwanted comments, rendering the overall response invalid. What's more, repeated use of the same query would show occasional variability even in the format of the json schema you send it.

Parameters constraining the output format have been introduced to overcome these problems. You can read about them on this page:
https://platform.openai.com/docs/guides/structured-outputs

Michael_Markert · February 3, 2025, 7:42am

Thanks for pointing that out, I didn't know of this feature (which seems to be intergrated in most LLM frameworks like Ollama and LMStudio now)!

Sunil_Natraj · February 3, 2025, 9:57am

Thanks @archilecteur. When using structured output the user will be required to provide a valid json schema, will the users be able to generate the schema?

I did a quick test, updated the code to use structured output, was able to test this with OpenAI, Hugging face was giving access error.

Sunil_Natraj · February 3, 2025, 12:57pm

Updated flow demo AI-Extension

archilecteur · February 3, 2025, 3:36pm

For the generation of a valid JSON schema, the shortest route for the user who is not too familiar with this format is to ask a language model. A service such as JSONLint (https://jsonlint.com) can then validate the schema.

I'm afraid I can't help with the HuggingFace error message. I seem to remember encountering an error message myself, when I wanted to use format constraints by calling a HuggingFace serverless inference point. I think the free serverless inference points don't allow advanced requests, but that's a guess, I haven't dug into it.

Sunil_Natraj · February 3, 2025, 4:24pm

Can you also review the updated flow.

Sunil_Natraj · February 5, 2025, 1:39pm

@ostephens @Martin @antonin_d @Michael_Markert @thadguidry @archilecteur @Ryan_E_Johnson

Demo of the AI extension in Open Refine

archilecteur · February 5, 2025, 3:13pm

I'm no UI expert, so I'll leave that to the authorities on the discussion list.

A few questions and observations in bulk:

Where do we enter the account API key, for external services?
The title “LLM provider” seems misleading. Shouldn't we simply use “Model” instead?
How do we go about using a locally-loaded model? Simply add a choice of “Ollama/model” or “LM Studio/model”? Then we'd have to offer the option of specifying the model. Perhaps decoupling the Provider from the Model?
Should we have more control over parameters? Temperature (creativity), Top-P (variability), Seed (reproducibility constraint), number of tokens (Length... and cost)? On the one hand, this promises to complicate the configuration window, and I personally like the idea of keeping it simple as it is, not intimidating. On the other hand, these parameters have a decisive influence on results. Perhaps add an optional free parameter field, similar to the “JSON schema” field. Those who know how to use it will do so.

Sunil_Natraj · February 5, 2025, 4:37pm

There is a management UI for LLM providers / model. It allows you to define local models as well. Here is a quick demo

Sunil_Natraj · February 7, 2025, 6:50am

Carried out usability and layout updates to the UI for defining the AI processing inputs --

Sunil_Natraj · February 7, 2025, 9:02am

@ostephens @Martin @antonin_d @Michael_Markert @thadguidry @archilecteur @Ryan_E_Johnson

The LLM extension is available for testing. I have attached the extension archive to this post. Please try it out and share your feedback.

openrefine-llm-extension-0.1.0.zip (158.9 KB)

Michael_Markert · February 7, 2025, 12:54pm

Dear @Sunil_Natraj,
thanks for your effort, it already looks great! I tried it but cannot save although everything is filled out (see screenshot, version 3.8.7). It would be great if the API key field could be optional as local services (I am using ollama) do not require a key. The service is running @localhost, I checked it using curl.
Best
Michael

Sunil_Natraj · February 7, 2025, 4:02pm

Hi Michael, Can you set a dummy value and complete the flow. I will update the flow to make API key optional.

psm · February 7, 2025, 4:41pm

Success.

But I don't see /edit services I entered here:

Sunil_Natraj · February 7, 2025, 5:02pm

Hi, Thanks for trying out the extension. The missing LLM provider entry could be due to some data / code issue. Can you check in the browser console for any error(s) and share.

Michael_Markert · February 7, 2025, 5:27pm

Would you be so kind to share the values in your form? I still get an error, now LLM request failed. Message : null with my local services (ollama + lmstudio).

psm · February 7, 2025, 5:34pm

Yes, plz.

OpenRefine version 3.8.7 tested with Firefox and Chrome with similar issue.

psm · February 7, 2025, 5:54pm

@Michael_Markert I tried this related to the above result:

{
"label" : "Ollama",
"apiURL" : "http://localhost:11434/v1/chat/completions",
"modelName" : "llama3.1:8b",
"temperature" : 1.0,
"maxTokens" : 5000,
"apiKey" : "1234"
}

Topic		Replies	Views
Using LLMs in OpenRefine for data wrangling with Hugging Face inference API Support and Helpdesk hints-and-tips	2	69	December 4, 2024
Script to interact with open source LLM Support and Helpdesk llm	5	171	May 28, 2024
OpenRefine access using python API Support and Helpdesk	1	385	February 16, 2023
How to use your own Python scripts as APIs in OpenRefine Support and Helpdesk hints-and-tips	3	516	April 26, 2024
Using the OpenAI API to apply natural language queries to cells/data Support and Helpdesk hints-and-tips	5	724	February 4, 2023

Using local ChatGPT-like LLMs in OpenRefine for data wrangling

Related topics