OpenRefine support for Lexemes in Wikidata: how would you use this?

At this moment, OpenRefine doesn’t yet support creating or editing Wikidata Lexemes (Lexicographical data). This has been a long-term request from the Wikidata community though: Support Wikidata lexemes · Issue #2240 · OpenRefine/OpenRefine · GitHub

There’s no prospective developer for this yet. If someone would show up as a candidate to develop this, I am sure that the current OpenRefine maintainers will do their best to onboard them and help them though.

Would you like to see this feature as a user? If that’s the case, you will help OpenRefine and prospective developers very much if you can describe (in a comment below, or directly in the GitHub issue):

  • What kind of edits and imports you would want to do with this feature (creating new lexemes? adding statements to senses? and so on)?
  • What would your initial dataset look like for that?

If you have an inclination for design, we would welcome your description or proposal (such as a visual mock-up, or textual description) of what lexeme editing would look like in OpenRefine, in particular given their nested structure (the ability to have forms and senses inside lexemes). Just describe the user experience you would expect as precisely as you can. What would it look like to create a new lexeme with three forms and four senses via OpenRefine? To add a sense to a lexeme? To add a statement to an existing sense in a lexeme? And so on.

Such descriptions will help prospective developers wrap their heads around the issue!

2 Likes

At this moment, OpenRefine doesn’t yet support creating or editing Wikidata Lexemes (Lexicographical data). This has been a long-term request from the Wikidata community though: Support Wikidata lexemes · Issue #2240 · OpenRefine/OpenRefine · GitHub

To put "long-term" in historical context, this was requested less than 3 years ago. Here's a list of almost 400 enhancement requests, most not Wikidata-specific, stretching back over 12 years. If any users would like to see some of these implemented, please give them a thumbs up so that they bubble to the top of this sorted list.

I'd love to see things implemented which will benefit all OpenRefine users, like better date handling or improved localization support.

Tom

Can we keep this thread on topic, and not silence a valid user request?

A general discussion about prioritization and focus for OpenRefine’s, both for the short term and the longer term, is a very useful one to have, and I would encourage the project leadership and user community to work on that together. I also very much welcome any OpenRefine user group to speak up and voice their own needs and priorities to feed that exercise.

1 Like

I imagine that the need will come when someone finds a freely licensed dictionary. Now, importing lexemes would be a bit different, since you probably both want to reconciliate towards existing lemmas and any senses that they already have. The easy part is if the lemma doesn’t exist at all, then a new one can be created from scratch. But if the lemma exists and if it has one or more senses, then those need to be checked as well in order to nor create duplicates. And I think reconciliating towards a sense on an item rather than on the entire item would be a bit novel, right?

As I said on Github, there is a lot of "basic" things that would be useful.

Lexemes are complex (especially because of the three level model : lexeme, sense, form) but being able to edit at least the lexeme (= main) level would be already great.

For a concret example, input data could be an external database (like an online dictionary) that could be use to add it as an identifier (which would improve quality and reliabity of lexeme entity, which is much needed right now).
The dataset would typically be a string (the main lemma), an id (in the url) and some other details (like language, lexical category, etc. all sort of clues tomake sure it's the same lexeme).

lemma, id, info
maison, A1M0031, feminine noun

Then, we would need:

  • a reconciliation service to find the corresponding Lexeme (L525 in this case)
  • to be able to edit Wikidata to add the id onto this corresponding Lexeme

On long-term, it would be ideal to be able to do the same (reconcile and edit) for the sense and form level but it would require more work (to make a good interface). But for a shorter term and as a first step, I think we could focus on the lexeme level.

3 Likes

True.
Creating a lexeme is already a bit difficult (the tool need to understand the three level model), I would suggest just editing the lexeme level of existing lexemes as a first easier step.
Reconciliation may seem more difficult but I guess it’s closer to what OpenRefine already do (so “kind of” easier), and it’s a more essential/fundamental bricks that we will need later anyway.

Hi! I made several bot imports on Wikidata lexemes. There are two use cases that could have been done (at least partially) with OpenRefine if it supported lexicographical data.

  1. Importing lexemes (by creating new ones and updating existing ones). This has been done for Breton lexemes using a Python bot (documentation). With OpenRefine, everything on a lexeme should be editable (lemmas, statements, senses, forms with their grammatical features, etc.).
  2. Matching lexemes, using lemma, lexical category, and statements, like P5185. This has been done to match existing French lexemes to several online dictionaries (example), again with a Python bot. In OpenRefine, highlighting multiple identical matches (like tour and tour) would be great (but certainly not a requirement).

These are only some use cases. There are probably many others, like matching senses or forms.

2 Likes

Hi y’all,

Any updates on this ?

By the way, testing somthing else, I notice that the function “Reconcile → Use values as identifiers” (see Reconciling with unique identifiers in the official doc) kinda-but-not-really understand this is a Lexeme.
You need to explicitely enter the full Lexeme:L123 ; the value is L123 (without the namespace as prefix) or L:L123 (with the shortcut for the namespace) is not understood.
This is good news already as we can technically can have Lexemes in a Wikidata Schema (which doesn’t work afterward - Invalid entity identifiers error - but a very small first step is here !).

Hi!

Answering to the open question: how would I use support for Lexemes on OpenRefine. Well, Wikimedia Portugal is currently working with a partner to import lexemes in a variety of Portuguese-based creoles. I really enjoyed @envlh post on Breton, and I will probably try to "mimic" the process to an extent, but since I am far from being an experienced Python coder, using a tool with a graphical interface would be a huge help - particularly, for reconciliation assurance of already existing datasets.