Since 2022, with support from a Wikimedia grant, it is possible to use OpenRefine to batch edit and upload files on Wikimedia Commons, with a focus on adding multilingual, linked, structured data to the files on Commons.
This new Wikimedia Commons functionality in OpenRefine is especially useful for cultural institutions who want to upload files to Commons with linked, structured data. OpenRefine offers powerful import functionalities from various data formats (csv, tsv, Excel sheets, XML…) and APIs (for those cultural institutions which use these). It also allows revisiting existing Wikimedia Commons files, improving their metadata, and adding multilingual structured data to them. Wikimedians in general can also use OpenRefine to batch upload their own or externally-hosted files to Wikimedia Commons.
For 2023-24, as part of its support for Wikimedia Commons, the Wikimedia Foundation is funding OpenRefine for bug fixes to its Commons features, for a train-the-trainer program, documentation, and a WikiLearn course.
In order to learn more about the current users of the Commons features in OpenRefine, and their needs, I held two parallel user surveys in October 2023.
- A first, classical, survey (fully anonymous) using the open source LimeSurvey, asked users how they use the Commons features in OpenRefine, and how they (want to) learn to use them better;
- A second, more experimental survey (also fully anonymous) using the open source platform Allourideas, asked users to prioritize the features that matter most to them.
Both surveys were actively announced to current users via various widely read channels for GLAM/cultural heritage contributors to Wikimedia projects. ln addition, 84 individuals who had actively used the features by early October 2023 have been personally invited to participate on their Wikimedia Commons talk page (example).
Of these 84 actively approached users (and through general announcements), 21 people completely filled in the first, traditional (LimeSurvey powered) survey and the same number of people, 21 (not necessarily the same individuals…) spent some time with the prioritization (Allourideas) survey as well.
Survey 1: Profile of current users, how they use the Commons features, and how they prefer to learn
All edits to Wikimedia projects are visible and traceable as part of page histories on every wiki (example). Using this public data, earlier this year I have already done a rough scan of past usage of the Wikimedia Commons features, identifying that the features are most actively used by cultural heritage institution staff, and by people affiliated with Wikimedia organizations (who usually directly support cultural heritage institutions). Only a small percentage of users can be purely identified as “general Wikimedia editors” without a cultural heritage interest or affiliation.
In October, a traditional survey, held with LimeSurvey, asked about respondents’ profiles directly, asked them about their usage of the Wikimedia Commons features and how they (would like to) learn about them.
The questions asked in the survey are available in this document. A full result summary (direct export from LimeSurvey) is available as a PDF file. The data collected in the survey is available in this spreadsheet - the survey was fully anonymous so this data does not contain any personal information.
Below I’ll include a selection of answers, mostly showing the associated graphs. Full results (including free text answers) are available in the PDF export.
Profile of users
Multiple answers were possible here. Most respondents (15 out of 21) identify as associated with a cultural organization or other partner. 14 identify as Wikimedians. As this survey was advertised on talk pages, there is probably a bit of bias towards active Wikimedians compared to the general overview of a few months ago.
Most users hail from Europe (mainly N/W/Central/East). There were unfortunately no respondents from the MENA region (Middle East and North Africa) or Sub-Saharan Africa.
Most respondents (13) edit Wikimedia projects in English, followed by French (6) and then German (3) and other options (Malayalam, Dutch (duplicate of a given option) and Basque). It was possible to give multiple answers.
Usage of OpenRefine
71.43% of respondents (15 out of 21) had used OpenRefine before.
Most use a local version of OpenRefine. Only a few (4) respondents use the cloud version on Wikimedia PAWS.
80.95% of respondents (17 people) indicate that they have downloaded and use the Wikimedia Commons extension. Still, a few people don’t know about the extension or are not sure. (To this day, the most recent version of this extension has been downloaded 257 times according to GitHub’s statistics.)
Two third of the 15 people who answered this question use OpenRefine outside Wikimedia. One third (five people) only for Wikimedia tasks.
Wikimedia Commons users of OpenRefine also use the software for other tasks, most importantly:
- Linking/reconciling with Wikidata (15 responses)
- Importing or editing data in Wikidata (14)
- General data cleaning and manipulation (13)
Familiarity with Wikimedia Commons, Wikidata, and OpenRefine in general
For brevity, I am mostly summarizing this section. Full details and graphs (where not mentioned) can be found in the PDF.
Previous familiarity with Wikimedia Commons
- Most respondents are moderately (23.81%) to very (57.14%) familiar with general uploading to Wikimedia Commons without OpenRefine. Only 4 respondents had never done this before they started this process with OpenRefine.
- There is a similar pattern for respondents moderately (23.81%) or very (52.38%) familiar with editing existing files on Wikimedia Commons before engaging with OpenRefine. A bit more people (19.05%) are totally unfamiliar though.
- Respondents feel slightly less confident and experienced about adding structured data to Commons files in general. 47.62% of respondents have done this but can learn a lot more, only 19.05% indicate they are very familiar with the process. 7 respondents have never done it before or only a few times.
Previous familiarity with Wikidata
- The vast majority of respondents (76.19%) had done manual Wikidata edits before embarking with the Commons features in OpenRefine, but 3 people were fully unfamiliar with it.
- A bit fewer people (61.90%) were very familiar with batch tools for this task (e.g. QuickStatements and OpenRefine) but 5 people were fully unfamiliar.
Previous familiarity with Wikipedia and other Wikimedia projects
- 52.38% is very familiar with this, 28.38% moderately familiar. Only 3 respondents are unfamiliar.
Previous familiarity with OpenRefine itself, and other software
- Less respondents were familiar with using OpenRefine for general data cleaning:
- With regards to spreadsheet software: most respondents are intermediate (11 respondents) or advanced (9) users.
- The next question asked about respondents’ usage of Pattypan, a previously popular, simple volunteer-maintained tool for guided batch uploads to Wikimedia Commons. For various reasons, Pattypan has not been consistently operational recently - this probably inspires several (6 out of 21) respondents to state that ‘they don’t want to use this tool’.
- Answers are different for QuickStatements, a simple (also volunteer-built) batch editing framework for Wikidata, which most respondents seem to actively use and master.
- Opinions are split around scripting and programming languages like Python and R. Some respondents are not proficient at all or don’t want to use these, others are somewhat to very proficient.
- And in line with the profile of respondents: a majority are somewhat (9 people) or very (6) familiar with some kind of (cultural heritage) collections management software.
Learning and training preferences
How did respondents familiarize themselves with (the Wikimedia Commons features of) OpenRefine? How do they prefer to be onboarded?
Respondents learned using OpenRefine in many diverse ways, the most popular being OpenRefine’s own documentation (18 out of 21 respondents), looking up specific tasks online (12), and following the instructions on Wikimedia Commons (11). Multiple answers could be given for this question.
Typical tasks and how achievable they are
The survey then asked respondents about (Wikimedia Commons related and general OpenRefine) tasks during their Wikimedia Commons workflow, and how achievable they found these. In general, most tasks generally take “average effort” but Wikidata related tasks are perceived as more achievable than others. Editing wikitext on Wikimedia Commons is clearly seen as a more challenging task.
- Manipulating, cleaning and preparing data:
- Uploading data to Wikidata:
- Editing existing data on Wikidata:
- Reconciling data with Wikidata:
- Editing existing files on Wikimedia Commons:
- Uploading files to Wikimedia Commons:
- Reconciling file names with Wikimedia Commons:
- Adding or editing structured data on Wikimedia Commons:
- Adding or editing wikitext on Wikimedia Commons:
- Using schemas and schema templates:
The survey then had an open question asking to describe whether the respondent was able to finish the task they wanted to do, and whether they encountered barriers. Several comments mentioned file reconciliation on Commons not always working as intended / expected due to certain characters not being recognized. Another returning answer included ‘not so intuitive UI / difficult / unclear UI’. All answers can be read in the PDF.
Confidence and plans to use OpenRefine again
- Confidence about one’s ability to use OpenRefine is generally moderate to high:
- Most respondents (20 out of 21) plan to use OpenRefine again, one respondent is unsure.
- When asked why (open question), several respondents state reasons that include ‘powerful’ and ‘easy’. See the PDF for the full list of answers.
- When asked how many files respondents plan to upload in the next year, answers vary widely:
- A next open question asked what kind of tasks with OpenRefine are recommended by the respondents. Answers vary widely and include many common OpenRefine tasks: data cleaning, reconciling, batch uploading to Wikidata and Wikimedia Commons. See the PDF for the full list of answers.
- One respondent, who indicated they may not use OpenRefine anymore, mention that they perceive the tool as “super clunky” and “99% of my data cleaning / manipulation needs can be handled through simpler tools”.
- When asked whether they would be interested in setting up a repeat or automated workflow with OpenRefine, 10 out of 20 respondents say ‘maybe’.
Feature and learning needs
- Respondents were asked in an open question what they would need most to make better use of Commons features in OpenRefine. Returning answers include better support for wikitext, very short instruction videos for each task separately, a clearer UI, and various requests around help and documentation. See the PDF for the full list of answers.
- There are not many answers to the question “What do you wish you had learned before starting doing your edits/uploads?” and each answer is different. One respondent asks about data modeling conventions (what goes on Wikidata, what goes in wikitext, etc); one respondent wanted more insight in how to work with Commons categories. See the PDF for the full list of answers.
- When asked how respondents prefer to learn to use OpenRefine, answers are very diverse too, with some stronger preference towards reading documentation at the learners’ own pace.
- 14 out of 21 respondents say “yes” when asked if they are interested in following the planned OpenRefine-Commons WikiLearn course.
Survey 2: Prioritization of most-wanted improvements
For this survey, I have used the open source initiative Allourideas, which I knew from previous projects (and which, asking around, has been used by other open source software projects for community feature prioritization as well). It’s a platform that helps communities prioritize ideas / needs in an easy way, by repetitively providing each respondent with two (random) choices from a list (predefined and/or user-generated).
Screenshot of a typical choice prompt that survey participants were presented with.
The OpenRefine-Commons survey in October 2023 was pre-fed (“seeded”) with a list of the following features (note that they are not only Wikimedia Commons-specific, but mostly contain a list of general high-level wishes that may be relevant to all OpenRefine users).
- Already identified Wikimedia-Commons specific needs (see list here)
- High-level needs identified in OpenRefine’s own draft roadmap which has been compiled by community members over the past years (see here)
- The list of general requests collected in OpenRefine’s two-yearly survey, 2022 edition (see here).
The aggregated list of “seed suggestions” is available in this spreadsheet.
In addition, users were able to add suggestions of their own. In total, two users added three new suggestions, but they were either duplicates of existing ones (“upload files larger than 100MB”), very generic (“better reconciliation”), or a question (“Isn’t it already possible to upload directly from harddrive?”).
The Allourideas platform provides a few graphs to show engagement with a survey. We see that 21 users have participated, and together they have “clicked” (i.e. indicated a preference) 512 times.
Results of the voting process
The results of the voting are the following. Scores >50 can be considered “upvoted” (with higher preference for higher scores), scores <50 “downvoted” (they were ignored/not prioritized more frequently when presented as an option to the user).
Allourideas presents these results on a web page only visible for the survey admin (from which I copy-pasted the above overview). It provides downloadable datasets too - however, they look different and contain details about each individual click, which can be challenging to parse. Those interested in the nitty-gritty of all user behavior in the survey can find the raw survey data in various spreadsheets in this folder. This specific spreadsheet contains the data closest to the overview I pasted above.
A few Wikimedia Commons related observations
Wikimedia Commons has recently reached the milestone of 100 million files on the platform. Of these 100 million, around 100,000 (or 0.1%) have now been uploaded with OpenRefine. External observers sometimes mention to me that working with Wikimedia Commons itself is challenging, which is true for complete outsiders - but nevertheless, so many people have managed to understand this process that the platform reached 100 million files! The Wikimedia movement has organized and created various upload and support processes to tackle this: guided uploads via tools like the UploadWizard or the upcoming Flickypedia; a lively and generally helpful editor community; and, crucially for cultural and other partners of the Wikimedia movement: support in the form of e.g. national and regional affiliates with volunteers and sometimes staff who actively help and onboard partners. It seems that most existing Wikimedia Commons users of OpenRefine were already familiar with the Commons upload and editing process and use the tool as a new approach for a (sometimes already well known) task. Probably OpenRefine is not the best tool for the job for complete newcomers.
As I am working on a train the trainer program and busy preparing an online course on WikiLearn, I was very interested in seeing the learning preferences of the respondents. It seems most are happy to follow documentation at their own pace. At the same time, I’m aware that the people who completed this survey have been those people who have already, on their own initiative, discovered the features and trained/educated themselves. I also have a lot of interaction with users for whom this is challenging (both practically but also because of a lower skill level), and I hear in these conversations that in-person training and online guided trainings are extremely important to bring people on board. To grow the number of Wikimedia Commons contributions through OpenRefine, the Wikimedia movement needs a large number of OpenRefine trainers and mentors.
With regards to feature prioritization, I am familiar with the Wikimedia Commons-related requests, and am not very surprised about the Commons-specific needs that drift to the top. With regards to (the very actively requested) better support for Wikitext: as discussed in another thread in this forum, the most broadly impactful approach, not just for OpenRefine but for any upload process to Wikimedia Commons, (in my opinion) would be to address this issue on the Wikimedia Commons side: make sure that uploaders can fully focus on structured data, and only need to use minimal Wikitext. I am bringing together a working group to address this.