Using local ChatGPT-like LLMs in OpenRefine for data wrangling

archilecteur · November 15, 2024, 1:26am

Dear Michael,

Thank you for the reply. It works just like you said. Great! And it works too with subfolders, as long as you follow the admin path. For long prompts, prompt versioning and sharing prompts on multiple environments, this is useful.

I did a bit of trial and error on a related topic: using Hugging Face models. Deployed in Google Colab, via a tunneled API through a local host (with Ollama and Ngrok), but could not replicate the success with LM Studio.

I wonder if by toying around in the Cursor platform, we could come up with an extension for OpenRefine.

Sunil_Natraj · January 22, 2025, 11:39am

All, I made an attempt to model the flow, video of the flow, let me know your thoughts/opinions.

Generating column using an LLM is a separate menu option
User can select from a list of LLM's which they have defined in the application; a separate option to capture the LLM related details ( LLM, Model, API end point, API key, ... )
User can input what is required to be done and a sample data which is the system prompt data
Option to preview is provided this will call the LLM using the column data from row 1

thadguidry · January 22, 2025, 12:47pm

Is this an OpenRefine extension?

Sunil_Natraj · January 22, 2025, 1:30pm

I have only thought of the end user flow so far. I believe we can model this as an extension but I have not explored it yet. Do share your inputs on this point.

archilecteur · January 22, 2025, 3:00pm

Well look at that. That’s a great start, thanks @Sunil_Natraj!

To extend your panel, it would need options to configure the parameters of the call. (Max tokens, Temperature, P value, etc., without forgetting the latest parameters that are just starting to be largely implemented, starting with Seed.) I'd suggest checking out what the LM Studio developers have done in this regard.

Among other things, pay attention to the use of structured output with a JSON schema. Here, we'd get something similar to export formatting in OpenRefine using the Templating option, but returned in a cell.

And for this to work at scale, in OpenRefine of all places (with the possibility of thousands of rows), parallelization must be implemented in the call.

Don’t forget the API key that needs to be put somewhere.

(For this to be in line with Michael's original intention, it would be appropriate to offer the choice of calling up a model run locally.)

Again, that is a great start, and super encouraging.

Sunil_Natraj · January 23, 2025, 8:47am

Thanks @archilecteur

The LLM specific details will be captured separately as mentioned in my earlier note.
On the point of scaling, LLM calls are expensive and I believe we need to explore doing bulk processing to reduce the number of calls and cost.
On the point of being able to use a private LLM instance, i believe that will be supported as long as the service is accessible via REST

Sunil_Natraj · January 23, 2025, 8:53am

Any other feedback on the user flow?
Which LLM's should be supported to start with - will be good to get a ranked list

archilecteur · January 23, 2025, 2:12pm

I would strongly advocate for open source models in priority, as OpenRefine is part of the open source community. Starting with Llamas, Mistral, DeepSeek... You’ll get a list of what’s popular right now on Hugging Face; but it changes from month to month.

Also, for one, I would be interested to use BERT-based encoder models for efficient and speedy name entity recognition in OpenRefine.

Prioritizing open source models will avoid the prohibitive costs you mentioned.

Michael_Markert · January 24, 2025, 7:48am

Thanks for the effort, @Sunil_Natraj! I am not sure if one should start with specific models as the endpoints all seem to use the same "OpenAI" call format like

curl 'https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.3/v1/chat/completions' \
-H 'Authorization: Bearer hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' \
-H 'Content-Type: application/json' \
--data '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	],
    "max_tokens": 500,
    "stream": false
}'

So one might register their model in a similar way as a reconciliation service.
One obstacle that comes to mind is that regular chat boxes hide model parameters (like a reconciliation service hides its query parameters) that OpenRefine users might want to edit directly when sending the call like temperature or max_completion_tokens as they can have a high impact on response quality. And as the parameters might be different between models I am not sure about how to model a universal front end.
One could think about having a form field for every element of the --data part of the POST message so users decide which parameters they might want to edit when registering a new LLM service. So when users switch models in the drop down the form changes as well. But I think that might be hard to implement.

Sunil_Natraj · January 24, 2025, 9:13am

The basic structure for chat completion is same, the subtle differences are in the parameter passing for example model is in the url for hugging face, params in claude. Every LLM platform also has additional parameters.

ostephens · January 24, 2025, 12:14pm

A couple of comments @Sunil_Natraj, one minor and one more substantial:

Minor: The information in "Provide a sample response" was not what I was expecting - instead this was a direction to format the response in a specific way - I think the use of this field needs to be clearer
More substantial: I think it would be better to use something like an existing OpenRefine layout for "Add column" where you get to see the current data and preview the results. Being able to see the current data in the cell is really important to judging that the correct things is being done. The two existing designs are the standard OpenRefine preview window

Screenshot 2025-01-24 at 12.07.421448×1014 56.7 KB

or the "add column from reconciled values"

Screenshot 2025-01-24 at 12.07.581618×744 44.5 KB

For me the usual OpenRefine window would bring more consistency, but of course here the "preview" might be a problem as the LLM would need to be accessed to achieve the preview.

The thing I don't like about the "add columns from reconciled" is you can't see the current data in the view. However, the ability to add multiple columns in a single operation, and the option to configure each column (which might be where you could specify the various user inputs) could be really nice

Anyway - I think the most important thing for me is: please make it easy for the user to see the existing data in the cell when they are doing this configuration. This means (IMO) including some of this in the configuration window (the data grid is hidden an inactive while the pop-over is present meaning you can't navigate it to see the data easily). This makes it so much easier to write and amend the description of what is needed, and also to see in a preview whether the correct data has been extracted in a set of sample cases

Martin · January 24, 2025, 1:14pm

I support the idea of using open models that run locally first, as this aligns with OpenRefine's philosophy.

@Sunil_Natraj for mock up and user flow you can take inspiration from the defunct GitHub - RubenVerborgh/Refine-NER-Extension: Named-Entity Recognition extension for Google Refine / OpenRefine.

I think there is two distinct screen

One to set up up the connection with the LLM model
One to call the LLM model to perform a particular action (the screen you presented).

Sunil_Natraj · January 24, 2025, 1:39pm

@Martin The screenshot in the repo is no longer available. Can you share or point me to where it is available.

Martin · January 24, 2025, 1:58pm

You can refer to this video https://www.youtube.com/watch?v=6O-B7mTJhQA

Ryan_E_Johnson · January 24, 2025, 3:00pm

I would also love to see support for local models, and also in my case, easy ways to plug into e.g. Azure-hosted LLM services. In general this would just mean supporting any API key, not just OpenAI, and I like the idea of a two step process. This would allow non-hacky ways to set up the LLM connection, and then a separate unified interface to do the actual LLM calls. Great work!

Sunil_Natraj · January 27, 2025, 6:37am

@ostephens Thank you for your inputs, updated view for review.

Sunil_Natraj · January 27, 2025, 6:38am

Thanks Ryan, The proposed design will allow adding any LLM service as long as it supports REST and it uses the standard chatCompletion request / response model.

ostephens · January 28, 2025, 10:45am

Thanks @Sunil_Natraj . I think this is definitely an improvement

Martin · January 31, 2025, 7:17pm

@Sunil_Natraj, after reflection, I think it may be simpler to interface with frameworks like LM Studio or gpt4all rather than trying to maintain direct integration with the LLM vendor or solution.

Sunil_Natraj · February 1, 2025, 6:32am

@ostephens @Martin @antonin_d @Michael_Markert @thadguidry @archilecteur @Ryan_E_Johnson

Sharing a short preview of the flow. Demo

Topic		Replies	Views
Using LLMs in OpenRefine for data wrangling with Hugging Face inference API Support and Helpdesk hints-and-tips	2	85	December 4, 2024
Script to interact with open source LLM Support and Helpdesk llm	5	183	May 28, 2024
OpenRefine access using python API Support and Helpdesk	1	399	February 16, 2023
How to use your own Python scripts as APIs in OpenRefine Support and Helpdesk hints-and-tips	3	537	April 26, 2024
Using the OpenAI API to apply natural language queries to cells/data Support and Helpdesk hints-and-tips	5	727	February 4, 2023

Using local ChatGPT-like LLMs in OpenRefine for data wrangling

Related topics