Using local ChatGPT-like LLMs in OpenRefine for data wrangling

I just made a tutorial on how to use LLMs on your own computer to wrangle, clean, and enrich data in an OpenRefine project using the API of LM Studio or a similar application. It is in German but there are English subtitles.

5 Likes

Michael, you made my day.
Vielen Dank!

1 Like

You are welcome, glad you like it!

Hello Michael,

Any chance you know how to adapt the code so that I could call the Hugging Face API?

https://medium.com/@researchgraph/how-to-use-hugging-face-api-2942ea9da32a

I’ve been hoping to do it for some time now.

1 Like

I have no clue of this API, but this might be helpful:
Python3 uses requests for sending and receiving messages as shown in the example you provided on the website, but the Python implementation in OpenRefine is Python2 so one has to use the package urllib instead of requests. Also you have to replace the "message" you want to send with variable "value" to take the content from your OpenRefine column and to change the encoding of the message so that OpenRefine can read it. The last thing is to "return content" to OpenRefine from the script – which replaces print(content).

The differences are very small. Here the "normal" Python Script for my usecase:

import requests
import json

url = "http://localhost:1234/v1/chat/completions"

headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json'
}

data = {
    "messages": [
        {
            "content": "Generate a basic JSON containing only the following information on the person mentioned: dateofbirth, placeofbirth, dateofdeath, placeofdeath. Do not provide further information.",
            "role": "assistant"
        },
        {
            "content": "Franz Kafka (* 3. Juli 1883 in Prag, Österreich-Ungarn; † 3. Juni 1924 in Klosterneuburg-Kierling, Österreich) jüdischer, deutschsprachiger Schriftsteller; 1902 machte er die Bekanntschaft des Schriftstellers Max Brod, mit dem ihn eine lebenslange Freundschaft verband. Im Juli 1912 reiste Franz Kafka mit Max Brod nach Weimar, wo sie im Hotel Chemnitius logierten. Bei einem Besuch des Goethehauses verliebte er sich in Margarethe Kirchner, die sechzehnjährige Tochter des Hausmeisters.",
            "role": "user"
        }
    ],
    "model": "TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
    "stream": False,
    "max_tokens": 2048,
    "stop": [
        "hello"
    ],
    "frequency_penalty": 0,
    "presence_penalty": 0,
    "temperature": 0.7,
    "top_p": 0.95
}

response = requests.post(url, headers=headers, json=data)
print(response.text)

and here the OpenRefine version:

import json
import urllib2

url = "http://localhost:1234/v1/chat/completions"

headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json'
}

data = {
    "messages": [
        {
            "content": "Generate a basic JSON containing only the following information on the person mentioned: dateofbirth, placeofbirth, dateofdeath, placeofdeath. Do not provide further information.",
            "role": "system"
        },
        {
            "content": value,  
            "role": "user"
        }
    ],
    "model": "TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
    "stream": False,
    "max_tokens": 2048,
    "stop": [
        "hello"
    ],
    "frequency_penalty": 0,
    "presence_penalty": 0,
    "temperature": 0.7,
    "top_p": 0.95
}

data_string = json.dumps(data)  # Convert dictionary to JSON string
data_bytes = data_string.encode('utf-8')  # Encode to bytes

req = urllib2.Request(url, data=data_bytes, headers=headers)

response = urllib2.urlopen(req)
response_bytes = response.read()
response_json = json.loads(response_bytes.decode('utf-8'))
content = response_json["choices"][0]["message"]["content"]
content = content.replace("<|end_of_turn|>", "")
return content
1 Like

Thank you for your reply, Michael.

I struggle to make it work. However, I’ve adapted a bit your code and have run OpenAI models, even o1-mini and o1-preview. Still would be neet to have it working with Hugging Face.

Next is to try the solution you provided with LM Studio.

1 Like

Do you know Hugging Sheets?

https://huggingface.co/spaces/JournalistsonHF/huggingface-on-sheets

Such integration in OpenRefine would give less code inclined users an access to using LLM in combination with it.

Hi @archilecteur I would be more interested in the direct use cases (the Questions that you are thinking of asking any LLM about for your datasets).

Can you provide a few of the kinds of questions about your data that would be useful?

Generally, Can you give some example snippets of your columns data and those kinds of questions your thinking about? That way, instead of a generic and very wide LLM solution (not always fast or optimal), then we, the community, might have a better sense on what kind of extensions could be developed for OpenRefine to help your use cases.

1 Like

Hugging Sheets is new to me, thanks for the tip! The local solution is especially suitable for anyone not wanting their data on an alien server because of security questions, data privacy (medical information and the like), or because the use of web services like that is not allowed at an institution.

Hi @thadguidry, thanks for the reply on the subject!

I've been thinking about your question all day. Since language models are great for manipulating text, there's no shortage of examples. Generally speaking, for my use, at least for the problems I'd like to solve with the help of language models in OpenRefine, it is especially the cases where high-level regularities cover irregularities of detail that regular expressions (even elaborate ones) fail to capture perfectly. If you add to this the fact that I often work with tens of thousands of lines, I find myself running out of time to check all the cases of these irregularities.

Let me share a sample project with you. Here, the original unsegmented record is in the first column. There are twenty-four records in this sample, and I have to deal with 63,000 similar records (which are already the result of a great deal of upstream work).

You'll notice that the record generally begins with a list of authors. The number of authors may vary. Their names are most often given in the format “initial followed by surname”, but there may be some inversions, and the first name may be given in full. Names are often capitalized, but there are many instances where this is not the case. Once in every twenty entries, there is a hyphenation in a name.

Next, research affiliations (usually academic, but not always) are mentioned. Sometimes in full (research group, research center, department, faculty, university), sometimes simply the name of the parent institution. Often there are several. Departments, universities. Sometimes the names of affiliations are inserted between the names. Sometimes affiliations are signaled by asterisks, numbers or letters next to the names (originally in superscript but that was lost), and are named in full at the bottom of the record. Sometimes the names of institutions are followed by their addresses, with postal codes. A final punctuation mark usually ends the mention of authors and their affiliations, but not always.

The title follows, without any particular indication. Sometimes the title contains a period. It usually ends with a period, but there are exceptions. Sometimes the title contains a chemical formula mixing letters, numbers, unusual characters, parenthesis, brackets, braces and punctuation marks.

Then comes the summary of the scientific paper.

Lastly, the abstract is sometimes followed by specific references, most often to research funds.

Affiliations, titles and abstracts contain several hyphenations. Particularly in the first part, many commas have been mistaken for periods by the character recognition operation. Sometimes, letters have been mistakenly recognized too (Clsude instead of Claude). And then you have the multiple line breaks wrongly introduced, especially in the case of chemical formulas.

Right. So you often have to deal with:

  1. very long text with semi-high variability of record patterns and separator elements...

definitely the realm of LLM's to easily partition into tabular format, for instance, just one of your source row cells (textual vomit :slight_smile: ) being nicely converted into a tabular record row by Bing CoPilot after I prompted it to convert the following text into a tabular record: - however, the Title and abstract was mangled and translated automatically into some wrong columns (CoPilot assumptions on the separator patterns were wrong - Humans remain better at recognizing language patterns and boundaries - Let's use AI as worker, but the Human remains the guide)

but re-Prompted to ask for corrections:

I think we could probably create a new OpenRefine "AI Services" extension that would generally handle much of the above example coded parts, into easier to work with dialog form:

  1. AI Services listing - updated/maintained by community - I.E. Hugging Face is a good candidate for this already.
  2. Extension should allow for custom HTTP headers (to account for the various AI API service endpoints and their lifecycle changes)
  3. Allow to edit and delete API Keys, maybe locally encrypted into OpenRefine Preferences array
  4. A Prompt Text input box that allows template variables for GREL column syntax, i.e. "convert the numbers in {{ columns.amount }} to British Pounds"
  5. Previewing of data manipulations (like our Expression editor gives us, but allows fullscreen temporary grid viewing and undo/redo during a Prompt convo)
  6. What else do we have as ideas here?
1 Like

@thadguidry that is a great idea!

That would be great obviously.

One issue that could be alleviated by UI design is the need to batch queries. Each row is a query, but if queries are sent in one call, as in the solution to create a new column, there is a real chance the job once started will exhaust the user’s patience. OR could easily appear stuck for hours. One workaround is to add a column and index records regrouping fixed size groups of rows. Instead of leaving it to the user, it could be taken in charge by OR: a column could be automatically added at the beginning of the table, even a temporary one, the user could be asked to fix the number of rows to send at a time. Something along the lines.

2 Likes

Yes, the bad UX for long-running operations is that's a problem that applies in many other cases (reconciliation, evaluation of complex Jython expressions…). I was hoping that the incremental updating of the grid that I implemented in the new architecture I have been working on would solve that, but unfortunately there isn't any prospect of it getting released soon.

@antonin_d Why not soon-ish? I mean, by let's say, next summer?

About this UX redesign, it would be great if there was a solution for all of OpenRefine operations with a risk of running for an extended time, that’s for sure.

Now you’ll have me looking for this update in each new release!

Julien Vallières

Envoyé de mon iPhone

It's a feature which requires breaking changes in the architecture (or at least I don't see how to implement this satisfactorily in the existing architecture).
I don't think we can expect those to happen at all, as there isn't a consensus about that in the dev team. So it's not just about waiting a few more months.

1 Like

I’ve been using this LLM integration with Python since @Michael_Markert posted his video.

For the more savvy Python users out there, how do you replace the hardcoded prompt with a call to a local text file containing the prompt?

I've tried defining a prompt_path and using a function to read it:

with open(prompt_path, 'r') as f:
prompt = f.read()

But that's as far as I go for the moment. And it doesn't work.