You are right; there will be no tight coupling with any platform / service.
Side note: You can definitely see how horrible some of the LLM models still are with data extraction in your demo. The researcher is actually François Peronnet
and has no publicly available vital record information. Llama 3+ vs. Llama 2 is no better.
However, when it comes to Entity Extraction and this kind of data extraction, my research has shown that the search engine companies and their LLM's are decently better, such as Bing Copilot (Deep search), and Google's Gemini when it comes to Entity Extraction as the case is here, where you will get their full names in the result, instead of abbreviations. Gemini, Grok2+3, Llama, Anthropic, and the others, give decent info about their research efforts in results, but fail when further context is given (JSON containing vital information for entities extracted). But all of them do a good job of refusing to give personal details (date of birth/death) in a response other than name and academic bio, citation info.
Good demo however of the flow, but your requested info is pushing the boundaries of what the non-search engine LLM's are capable of. (Grok is a small exception).
It's taking shape, great work Sunil!
One important point I'd like to come back to is the output constraint. For feature extraction, I'd say that the proposed framework is now outdated. When we call the service, it's common to constrain the output format by selecting a âstructured outputâ parameter: âjson_schemaâ, and passing the reinforced schema in the call.
In the demo, if you analyze the result of the request to the model, it's easy to see that the model returns a part of the response that matches the requested structure, but this part is preceded and followed by unwanted comments, rendering the overall response invalid. What's more, repeated use of the same query would show occasional variability even in the format of the json schema you send it.
Parameters constraining the output format have been introduced to overcome these problems. You can read about them on this page:
https://platform.openai.com/docs/guides/structured-outputs
Thanks for pointing that out, I didn't know of this feature (which seems to be intergrated in most LLM frameworks like Ollama and LMStudio now)!
Thanks @archilecteur. When using structured output the user will be required to provide a valid json schema, will the users be able to generate the schema?
I did a quick test, updated the code to use structured output, was able to test this with OpenAI, Hugging face was giving access error.
Updated flow demo AI-Extension
For the generation of a valid JSON schema, the shortest route for the user who is not too familiar with this format is to ask a language model. A service such as JSONLint (https://jsonlint.com) can then validate the schema.
I'm afraid I can't help with the HuggingFace error message. I seem to remember encountering an error message myself, when I wanted to use format constraints by calling a HuggingFace serverless inference point. I think the free serverless inference points don't allow advanced requests, but that's a guess, I haven't dug into it.
Can you also review the updated flow.
@ostephens @Martin @antonin_d @Michael_Markert @thadguidry @archilecteur @Ryan_E_Johnson
Demo of the AI extension in Open Refine
I'm no UI expert, so I'll leave that to the authorities on the discussion list.
A few questions and observations in bulk:
-
Where do we enter the account API key, for external services?
-
The title âLLM providerâ seems misleading. Shouldn't we simply use âModelâ instead?
-
How do we go about using a locally-loaded model? Simply add a choice of âOllama/modelâ or âLM Studio/modelâ? Then we'd have to offer the option of specifying the model. Perhaps decoupling the Provider from the Model?
-
Should we have more control over parameters? Temperature (creativity), Top-P (variability), Seed (reproducibility constraint), number of tokens (Length... and cost)? On the one hand, this promises to complicate the configuration window, and I personally like the idea of keeping it simple as it is, not intimidating. On the other hand, these parameters have a decisive influence on results. Perhaps add an optional free parameter field, similar to the âJSON schemaâ field. Those who know how to use it will do so.
There is a management UI for LLM providers / model. It allows you to define local models as well. Here is a quick demo
@ostephens @Martin @antonin_d @Michael_Markert @thadguidry @archilecteur @Ryan_E_Johnson
The LLM extension is available for testing. I have attached the extension archive to this post. Please try it out and share your feedback.
openrefine-llm-extension-0.1.0.zip (158.9 KB)
Dear @Sunil_Natraj,
thanks for your effort, it already looks great! I tried it but cannot save although everything is filled out (see screenshot, version 3.8.7). It would be great if the API key field could be optional as local services (I am using ollama) do not require a key. The service is running @localhost, I checked it using curl.
Best
Michael
Hi Michael, Can you set a dummy value and complete the flow. I will update the flow to make API key optional.
Hi, Thanks for trying out the extension. The missing LLM provider entry could be due to some data / code issue. Can you check in the browser console for any error(s) and share.
Would you be so kind to share the values in your form? I still get an error, now LLM request failed. Message : null
with my local services (ollama + lmstudio).
@Michael_Markert I tried this related to the above result:
{
"label" : "Ollama",
"apiURL" : "http://localhost:11434/v1/chat/completions",
"modelName" : "llama3.1:8b",
"temperature" : 1.0,
"maxTokens" : 5000,
"apiKey" : "1234"
}