OpenRefine support for Lexemes in Wikidata: how would you use this?

At this moment, OpenRefine doesn’t yet support creating or editing Wikidata Lexemes (Lexicographical data). This has been a long-term request from the Wikidata community though: Support Wikidata lexemes · Issue #2240 · OpenRefine/OpenRefine · GitHub

There’s no prospective developer for this yet. If someone would show up as a candidate to develop this, I am sure that the current OpenRefine maintainers will do their best to onboard them and help them though.

Would you like to see this feature as a user? If that’s the case, you will help OpenRefine and prospective developers very much if you can describe (in a comment below, or directly in the GitHub issue):

  • What kind of edits and imports you would want to do with this feature (creating new lexemes? adding statements to senses? and so on)?
  • What would your initial dataset look like for that?

If you have an inclination for design, we would welcome your description or proposal (such as a visual mock-up, or textual description) of what lexeme editing would look like in OpenRefine, in particular given their nested structure (the ability to have forms and senses inside lexemes). Just describe the user experience you would expect as precisely as you can. What would it look like to create a new lexeme with three forms and four senses via OpenRefine? To add a sense to a lexeme? To add a statement to an existing sense in a lexeme? And so on.

Such descriptions will help prospective developers wrap their heads around the issue!

4 Likes

At this moment, OpenRefine doesn’t yet support creating or editing Wikidata Lexemes (Lexicographical data). This has been a long-term request from the Wikidata community though: Support Wikidata lexemes · Issue #2240 · OpenRefine/OpenRefine · GitHub

To put "long-term" in historical context, this was requested less than 3 years ago. Here's a list of almost 400 enhancement requests, most not Wikidata-specific, stretching back over 12 years. If any users would like to see some of these implemented, please give them a thumbs up so that they bubble to the top of this sorted list.

I'd love to see things implemented which will benefit all OpenRefine users, like better date handling or improved localization support.

Tom

Can we keep this thread on topic, and not silence a valid user request?

A general discussion about prioritization and focus for OpenRefine’s, both for the short term and the longer term, is a very useful one to have, and I would encourage the project leadership and user community to work on that together. I also very much welcome any OpenRefine user group to speak up and voice their own needs and priorities to feed that exercise.

5 Likes

I imagine that the need will come when someone finds a freely licensed dictionary. Now, importing lexemes would be a bit different, since you probably both want to reconciliate towards existing lemmas and any senses that they already have. The easy part is if the lemma doesn’t exist at all, then a new one can be created from scratch. But if the lemma exists and if it has one or more senses, then those need to be checked as well in order to nor create duplicates. And I think reconciliating towards a sense on an item rather than on the entire item would be a bit novel, right?

As I said on Github, there is a lot of "basic" things that would be useful.

Lexemes are complex (especially because of the three level model : lexeme, sense, form) but being able to edit at least the lexeme (= main) level would be already great.

For a concret example, input data could be an external database (like an online dictionary) that could be use to add it as an identifier (which would improve quality and reliabity of lexeme entity, which is much needed right now).
The dataset would typically be a string (the main lemma), an id (in the url) and some other details (like language, lexical category, etc. all sort of clues tomake sure it's the same lexeme).

lemma, id, info
maison, A1M0031, feminine noun

Then, we would need:

  • a reconciliation service to find the corresponding Lexeme (L525 in this case)
  • to be able to edit Wikidata to add the id onto this corresponding Lexeme

On long-term, it would be ideal to be able to do the same (reconcile and edit) for the sense and form level but it would require more work (to make a good interface). But for a shorter term and as a first step, I think we could focus on the lexeme level.

3 Likes

True.
Creating a lexeme is already a bit difficult (the tool need to understand the three level model), I would suggest just editing the lexeme level of existing lexemes as a first easier step.
Reconciliation may seem more difficult but I guess it’s closer to what OpenRefine already do (so “kind of” easier), and it’s a more essential/fundamental bricks that we will need later anyway.

Hi! I made several bot imports on Wikidata lexemes. There are two use cases that could have been done (at least partially) with OpenRefine if it supported lexicographical data.

  1. Importing lexemes (by creating new ones and updating existing ones). This has been done for Breton lexemes using a Python bot (documentation). With OpenRefine, everything on a lexeme should be editable (lemmas, statements, senses, forms with their grammatical features, etc.).
  2. Matching lexemes, using lemma, lexical category, and statements, like P5185. This has been done to match existing French lexemes to several online dictionaries (example), again with a Python bot. In OpenRefine, highlighting multiple identical matches (like tour and tour) would be great (but certainly not a requirement).

These are only some use cases. There are probably many others, like matching senses or forms.

2 Likes

Hi y’all,

Any updates on this ?

By the way, testing somthing else, I notice that the function “Reconcile → Use values as identifiers” (see Reconciling with unique identifiers in the official doc) kinda-but-not-really understand this is a Lexeme.
You need to explicitely enter the full Lexeme:L123 ; the value is L123 (without the namespace as prefix) or L:L123 (with the shortcut for the namespace) is not understood.
This is good news already as we can technically can have Lexemes in a Wikidata Schema (which doesn’t work afterward - Invalid entity identifiers error - but a very small first step is here !).

1 Like

Hi!

Answering to the open question: how would I use support for Lexemes on OpenRefine. Well, Wikimedia Portugal is currently working with a partner to import lexemes in a variety of Portuguese-based creoles. I really enjoyed @envlh post on Breton, and I will probably try to "mimic" the process to an extent, but since I am far from being an experienced Python coder, using a tool with a graphical interface would be a huge help - particularly, for reconciliation assurance of already existing datasets.

3 Likes

Any updates on this ?

Is there anything that could be done to help this?

1 Like

Hi @Nicolas_VIGNERON,

At the moment, we have an ongoing project to improve the Wikimedia Commons integration in OpenRefine, to address the most urgent issues that have been reported there. This is part of a WMF-funded project focused on the train-the-trainers program. This development work will be done by Wikimedia Sweden (we have been discussing that with them, I guess we could do a formal announcement soon too).

In itself, that's unrelated to lexemes, but the reason why I mention it is that it's an attempt to partner with Wikimedia organizations to help them build up in-house capacity to work on Wikimedia integration in OpenRefine. On the long term, I hope this helps us maintain and develop such integrations :slight_smile:

My hope would be that this partnership works out well for the Commons integration and that people see it as a viable way forward. A similar arrangement could be used for developing a lexeme integration (be it by Wikimedia Sweden, another chapter, or another organization entirely).

So, one way you could help with this would be to help identify who those partners could be and looking for potential funders. At the moment, it seems difficult to get funding from WMF for such work, but there might be other ways to make this happen.

You might ask: why not do that in-house, within the OpenRefine project directly? It would also be possible for sure, but at the moment we haven't got any funding for it nor volunteers who would like to work on this. Also, we have developed quite a bit of Wikimedia integration in OpenRefine over the past years (first for Wikidata, then generalizing to Wikibase, then adding support for Wikimedia Commons) and we need to be realistic about our capacity: each new integration we develop naturally creates an expectation of maintenance and support, which we need to be able to meet on the long term. At the same time, there are a lot of other aspects of OpenRefine that also require our attention. So personally, I have tried to shift my focus away from Wikimedia integration to working on improvements which benefit all OpenRefine users (including the Wikimedia community), for instance the reproducibility improvements at the moment.

Despite this shift, I am really supportive of more work being done on Wikimedia integration, and I will be working closely with Wikimedia Sweden to make this partnership a success. I would be similarly happy to support anyone else wanting to work in this space :slight_smile:

2 Likes

I have started a list of extension requests on the wiki and added Wikidata lexemes there.

Note that it's not clear to me if it can/should be a separate extension or built into the Wikibase extension: it's an open question that I listed there.

Is there anything new about this?

Right now, Wikidata Lexemes are in a phase where we add a lot of identifiers right now, OpenRefine would be very useful for this task.

We talked a bit about it with some people (including @abbe98 @Andre_Costa @Alicia_Fagerving ) during the last Wikimania.