Results of two user surveys for Wikimedia Commons users of OpenRefine

Since 2022, with support from a Wikimedia grant, it is possible to use OpenRefine to batch edit and upload files on Wikimedia Commons, with a focus on adding multilingual, linked, structured data to the files on Commons.

This new Wikimedia Commons functionality in OpenRefine is especially useful for cultural institutions who want to upload files to Commons with linked, structured data. OpenRefine offers powerful import functionalities from various data formats (csv, tsv, Excel sheets, XML…) and APIs (for those cultural institutions which use these). It also allows revisiting existing Wikimedia Commons files, improving their metadata, and adding multilingual structured data to them. Wikimedians in general can also use OpenRefine to batch upload their own or externally-hosted files to Wikimedia Commons.

For 2023-24, as part of its support for Wikimedia Commons, the Wikimedia Foundation is funding OpenRefine for bug fixes to its Commons features, for a train-the-trainer program, documentation, and a WikiLearn course.

In order to learn more about the current users of the Commons features in OpenRefine, and their needs, I held two parallel user surveys in October 2023.

  1. A first, classical, survey (fully anonymous) using the open source LimeSurvey, asked users how they use the Commons features in OpenRefine, and how they (want to) learn to use them better;
  2. A second, more experimental survey (also fully anonymous) using the open source platform Allourideas, asked users to prioritize the features that matter most to them.

Both surveys were actively announced to current users via various widely read channels for GLAM/cultural heritage contributors to Wikimedia projects. ln addition, 84 individuals who had actively used the features by early October 2023 have been personally invited to participate on their Wikimedia Commons talk page (example).

Of these 84 actively approached users (and through general announcements), 21 people completely filled in the first, traditional (LimeSurvey powered) survey and the same number of people, 21 (not necessarily the same individuals…) spent some time with the prioritization (Allourideas) survey as well.

Survey 1: Profile of current users, how they use the Commons features, and how they prefer to learn

All edits to Wikimedia projects are visible and traceable as part of page histories on every wiki (example). Using this public data, earlier this year I have already done a rough scan of past usage of the Wikimedia Commons features, identifying that the features are most actively used by cultural heritage institution staff, and by people affiliated with Wikimedia organizations (who usually directly support cultural heritage institutions). Only a small percentage of users can be purely identified as “general Wikimedia editors” without a cultural heritage interest or affiliation.

In October, a traditional survey, held with LimeSurvey, asked about respondents’ profiles directly, asked them about their usage of the Wikimedia Commons features and how they (would like to) learn about them.

The questions asked in the survey are available in this document. A full result summary (direct export from LimeSurvey) is available as a PDF file. The data collected in the survey is available in this spreadsheet - the survey was fully anonymous so this data does not contain any personal information.

Below I’ll include a selection of answers, mostly showing the associated graphs. Full results (including free text answers) are available in the PDF export.

Profile of users

Multiple answers were possible here. Most respondents (15 out of 21) identify as associated with a cultural organization or other partner. 14 identify as Wikimedians. As this survey was advertised on talk pages, there is probably a bit of bias towards active Wikimedians compared to the general overview of a few months ago.

Most users hail from Europe (mainly N/W/Central/East). There were unfortunately no respondents from the MENA region (Middle East and North Africa) or Sub-Saharan Africa.

Most respondents (13) edit Wikimedia projects in English, followed by French (6) and then German (3) and other options (Malayalam, Dutch (duplicate of a given option) and Basque). It was possible to give multiple answers.

Usage of OpenRefine

71.43% of respondents (15 out of 21) had used OpenRefine before.

Most use a local version of OpenRefine. Only a few (4) respondents use the cloud version on Wikimedia PAWS.

80.95% of respondents (17 people) indicate that they have downloaded and use the Wikimedia Commons extension. Still, a few people don’t know about the extension or are not sure. (To this day, the most recent version of this extension has been downloaded 257 times according to GitHub’s statistics.)

Two third of the 15 people who answered this question use OpenRefine outside Wikimedia. One third (five people) only for Wikimedia tasks.

Wikimedia Commons users of OpenRefine also use the software for other tasks, most importantly:

  • Linking/reconciling with Wikidata (15 responses)
  • Importing or editing data in Wikidata (14)
  • General data cleaning and manipulation (13)

Familiarity with Wikimedia Commons, Wikidata, and OpenRefine in general

For brevity, I am mostly summarizing this section. Full details and graphs (where not mentioned) can be found in the PDF.

Previous familiarity with Wikimedia Commons

  • Most respondents are moderately (23.81%) to very (57.14%) familiar with general uploading to Wikimedia Commons without OpenRefine. Only 4 respondents had never done this before they started this process with OpenRefine.
  • There is a similar pattern for respondents moderately (23.81%) or very (52.38%) familiar with editing existing files on Wikimedia Commons before engaging with OpenRefine. A bit more people (19.05%) are totally unfamiliar though.
  • Respondents feel slightly less confident and experienced about adding structured data to Commons files in general. 47.62% of respondents have done this but can learn a lot more, only 19.05% indicate they are very familiar with the process. 7 respondents have never done it before or only a few times.

Previous familiarity with Wikidata

  • The vast majority of respondents (76.19%) had done manual Wikidata edits before embarking with the Commons features in OpenRefine, but 3 people were fully unfamiliar with it.
  • A bit fewer people (61.90%) were very familiar with batch tools for this task (e.g. QuickStatements and OpenRefine) but 5 people were fully unfamiliar.

Previous familiarity with Wikipedia and other Wikimedia projects

  • 52.38% is very familiar with this, 28.38% moderately familiar. Only 3 respondents are unfamiliar.

Previous familiarity with OpenRefine itself, and other software

  • Less respondents were familiar with using OpenRefine for general data cleaning:
  • With regards to spreadsheet software: most respondents are intermediate (11 respondents) or advanced (9) users.
  • The next question asked about respondents’ usage of Pattypan, a previously popular, simple volunteer-maintained tool for guided batch uploads to Wikimedia Commons. For various reasons, Pattypan has not been consistently operational recently - this probably inspires several (6 out of 21) respondents to state that ‘they don’t want to use this tool’.
  • Answers are different for QuickStatements, a simple (also volunteer-built) batch editing framework for Wikidata, which most respondents seem to actively use and master.
  • Opinions are split around scripting and programming languages like Python and R. Some respondents are not proficient at all or don’t want to use these, others are somewhat to very proficient.
  • And in line with the profile of respondents: a majority are somewhat (9 people) or very (6) familiar with some kind of (cultural heritage) collections management software.

Learning and training preferences

How did respondents familiarize themselves with (the Wikimedia Commons features of) OpenRefine? How do they prefer to be onboarded?

Respondents learned using OpenRefine in many diverse ways, the most popular being OpenRefine’s own documentation (18 out of 21 respondents), looking up specific tasks online (12), and following the instructions on Wikimedia Commons (11). Multiple answers could be given for this question.

Typical tasks and how achievable they are

The survey then asked respondents about (Wikimedia Commons related and general OpenRefine) tasks during their Wikimedia Commons workflow, and how achievable they found these. In general, most tasks generally take “average effort” but Wikidata related tasks are perceived as more achievable than others. Editing wikitext on Wikimedia Commons is clearly seen as a more challenging task.

The survey then had an open question asking to describe whether the respondent was able to finish the task they wanted to do, and whether they encountered barriers. Several comments mentioned file reconciliation on Commons not always working as intended / expected due to certain characters not being recognized. Another returning answer included ‘not so intuitive UI / difficult / unclear UI’. All answers can be read in the PDF.

Confidence and plans to use OpenRefine again

  • Confidence about one’s ability to use OpenRefine is generally moderate to high:
  • Most respondents (20 out of 21) plan to use OpenRefine again, one respondent is unsure.
  • When asked why (open question), several respondents state reasons that include ‘powerful’ and ‘easy’. See the PDF for the full list of answers.
  • When asked how many files respondents plan to upload in the next year, answers vary widely:
  • A next open question asked what kind of tasks with OpenRefine are recommended by the respondents. Answers vary widely and include many common OpenRefine tasks: data cleaning, reconciling, batch uploading to Wikidata and Wikimedia Commons. See the PDF for the full list of answers.
  • One respondent, who indicated they may not use OpenRefine anymore, mention that they perceive the tool as “super clunky” and “99% of my data cleaning / manipulation needs can be handled through simpler tools”.
  • When asked whether they would be interested in setting up a repeat or automated workflow with OpenRefine, 10 out of 20 respondents say ‘maybe’.

Feature and learning needs

  • Respondents were asked in an open question what they would need most to make better use of Commons features in OpenRefine. Returning answers include better support for wikitext, very short instruction videos for each task separately, a clearer UI, and various requests around help and documentation. See the PDF for the full list of answers.
  • There are not many answers to the question “What do you wish you had learned before starting doing your edits/uploads?” and each answer is different. One respondent asks about data modeling conventions (what goes on Wikidata, what goes in wikitext, etc); one respondent wanted more insight in how to work with Commons categories. See the PDF for the full list of answers.
  • When asked how respondents prefer to learn to use OpenRefine, answers are very diverse too, with some stronger preference towards reading documentation at the learners’ own pace.
  • 14 out of 21 respondents say “yes” when asked if they are interested in following the planned OpenRefine-Commons WikiLearn course.

Survey 2: Prioritization of most-wanted improvements

For this survey, I have used the open source initiative Allourideas, which I knew from previous projects (and which, asking around, has been used by other open source software projects for community feature prioritization as well). It’s a platform that helps communities prioritize ideas / needs in an easy way, by repetitively providing each respondent with two (random) choices from a list (predefined and/or user-generated).

Screenshot of a typical choice prompt that survey participants were presented with.

The OpenRefine-Commons survey in October 2023 was pre-fed (“seeded”) with a list of the following features (note that they are not only Wikimedia Commons-specific, but mostly contain a list of general high-level wishes that may be relevant to all OpenRefine users).

  1. Already identified Wikimedia-Commons specific needs (see list here)
  2. High-level needs identified in OpenRefine’s own draft roadmap which has been compiled by community members over the past years (see here)
  3. The list of general requests collected in OpenRefine’s two-yearly survey, 2022 edition (see here).

The aggregated list of “seed suggestions” is available in this spreadsheet.

In addition, users were able to add suggestions of their own. In total, two users added three new suggestions, but they were either duplicates of existing ones (“upload files larger than 100MB”), very generic (“better reconciliation”), or a question (“Isn’t it already possible to upload directly from harddrive?”).

The Allourideas platform provides a few graphs to show engagement with a survey. We see that 21 users have participated, and together they have “clicked” (i.e. indicated a preference) 512 times.

Results of the voting process

The results of the voting are the following. Scores >50 can be considered “upvoted” (with higher preference for higher scores), scores <50 “downvoted” (they were ignored/not prioritized more frequently when presented as an option to the user).

Ideas Score (0 - 100)
Reconciliation with arbitrary external datasets, e.g. csv reconciliation, reconciliation with another OpenRefine project, or via API 78
Multi-user support: allowing two or more people to work on the same project 75
(Wikimedia Commons) Simplify the creation of Wikitext when uploading media to Commons 73
Easier import from external datasets (e.g. via API) 71
(Wikimedia Commons) Make it possible to upload files larger than 100MB 70
More and better notifications, error messages and warnings in OpenRefine 69
Upload files größter than 100 MB "chunked uploads" 67
(Wikimedia Commons) Preload standard (mandatory or recommended) metadata fields (schema template) when starting a Commons project in OpenRe… 67
(Wikimedia Commons) Better error reporting when uploads fail 66
Feature to add new rows in OpenRefine 65
Faster upload to Wikibase, Wikidata, or other Wikimedia projects 62
(Wikimedia Commons) Better preview of what uploaded files will look like 59
(Wikimedia Commons) Start the uploading process by simply selecting a folder on my harddrive 59
Support for more diverse (human language) alphabets/scripts, date and time formats... 56
Drag and drop for columns 56
(Wikimedia Commons) Load media files directly from my harddrive 55
A walkthrough tutorial inside the software itself, to introduce and guide new users 50
Make OpenRefine easier to learn and get started with: better or easier UX / interface 50
Less abandoned/inactive reconciliation services: clean up inactive ones 50
More reconciliation services 48
Loading and working very large projects more easily/smoothly (100,000s of rows/records) 47
Some simple data visualization features 47
An online, hosted instance of OpenRefine 44
(Wikimedia Commons) Load data from EXIF of the files 44
Better reconciliation 44
(Wikimedia Commons) A dedicated importer from Flickr 43
(Wikimedia Commons) Upload an IIIF collection to Commons 42
Reconciling against a SPARQL query 42
Faster and more powerful reconciliation 41
More 'point and click' functions to replace GREL 39
Pause and resume my operations in OpenRefine 37
(Wikimedia Commons bug fix) Display thumbnails for all Wikimedia Commons files 37
Less abandoned OpenRefine extensions: only present maintained and currently operational ones 32
(Wikimedia Commons) Provide thumbnails for files that are hosted on the local computer, i.e. personal harddrive 30
A keyboard-accessible GUI 28
Better Python support 25
Allow working with R 12

Allourideas presents these results on a web page only visible for the survey admin (from which I copy-pasted the above overview). It provides downloadable datasets too - however, they look different and contain details about each individual click, which can be challenging to parse. Those interested in the nitty-gritty of all user behavior in the survey can find the raw survey data in various spreadsheets in this folder. This specific spreadsheet contains the data closest to the overview I pasted above.

A few Wikimedia Commons related observations

Wikimedia Commons has recently reached the milestone of 100 million files on the platform. Of these 100 million, around 100,000 (or 0.1%) have now been uploaded with OpenRefine. External observers sometimes mention to me that working with Wikimedia Commons itself is challenging, which is true for complete outsiders - but nevertheless, so many people have managed to understand this process that the platform reached 100 million files! The Wikimedia movement has organized and created various upload and support processes to tackle this: guided uploads via tools like the UploadWizard or the upcoming Flickypedia; a lively and generally helpful editor community; and, crucially for cultural and other partners of the Wikimedia movement: support in the form of e.g. national and regional affiliates with volunteers and sometimes staff who actively help and onboard partners. It seems that most existing Wikimedia Commons users of OpenRefine were already familiar with the Commons upload and editing process and use the tool as a new approach for a (sometimes already well known) task. Probably OpenRefine is not the best tool for the job for complete newcomers.

As I am working on a train the trainer program and busy preparing an online course on WikiLearn, I was very interested in seeing the learning preferences of the respondents. It seems most are happy to follow documentation at their own pace. At the same time, I’m aware that the people who completed this survey have been those people who have already, on their own initiative, discovered the features and trained/educated themselves. I also have a lot of interaction with users for whom this is challenging (both practically but also because of a lower skill level), and I hear in these conversations that in-person training and online guided trainings are extremely important to bring people on board. To grow the number of Wikimedia Commons contributions through OpenRefine, the Wikimedia movement needs a large number of OpenRefine trainers and mentors.

With regards to feature prioritization, I am familiar with the Wikimedia Commons-related requests, and am not very surprised about the Commons-specific needs that drift to the top. With regards to (the very actively requested) better support for Wikitext: as discussed in another thread in this forum, the most broadly impactful approach, not just for OpenRefine but for any upload process to Wikimedia Commons, (in my opinion) would be to address this issue on the Wikimedia Commons side: make sure that uploaders can fully focus on structured data, and only need to use minimal Wikitext. I am bringing together a working group to address this.

6 Likes

A few of my many thoughts:

  1. Great to see broad global usage, but disappointed no MENA users this time.

The above seems to resonate with your own past observations of training difficulty face-to-face perhaps because of deeper knowledge often needed for GLAM folks and full day+ trainings? In light of that "deeper knowledge" needed, do you feel we (OpenRefine) or Wikimedia should fund video trainings above and beyond your grant allocation? Or do you feel video trainings are plentiful enough out there and we (OpenRefine) need to do better at gathering up and presenting quality tutorials (maybe a bunch of mini feature tutorials) on our website? My hunch, and from what I've heard from you before, is that we need to do 2 things: improve docs/tutorials, improve in-app guides? Do you still feel this way or now feel differently?

That feature combines 2 feature things: 1. Data Joining with key(s) which has been asked for by journalists and scientists before. 2. And self-hosted reconciliation basically, correct?

Bravo and thank you very much for the effort. It is a very useful effort, @Sandra.

Sorry I haven't had time to answer myself, I regret it.

Best Regards,
Antoine

Thanks a lot for publishing this @Sandra! The results are very interesting.

The results of the LimeSurvey seem (surprisingly to me) overall rather positive, it's nice to see that.
I am pleasantly surprised that the Commons extension has such a high uptake - I had the impression that more people were using the upload functionalities of the preinstalled Wikibase extension, without relying on the Commons one, but it looks like quite some people manage to install the extension despite the rather complicated process.

There are tons of comments that come to mind about the ranking from the Allourideas survey, I am not sure this is the best place to discuss them all. Only one note from me: my biggest surprise perhaps is that there is such a high interest in multi-user support. Although it is indeed a feature that is requested regularly, there were quite convincing arguments recently about the fact that data cleaning in OpenRefine isn't something that can be parallelized between multiple people that well (@abbe98 and @thadguidry in this thread about hosted uses of OpenRefine). So it would be interesting to investigate more which sorts of tasks people were thinking about doing as a team.

On the shorter term, the outcomes of the surveys will be very useful to inform which bugs or features to work on as part of the upcoming sprint on the Commons extension.

Agree, and no users from Sub-Saharan Africa either - although both regions have an extremely vibrant and ambitious Wikimedia community.

I notice that I'm already spending (much) more time in general than the hours allocated / budgeted in the current Wikimedia grant. It's very labor intensive work, also because I need to onboard myself into a new platform (WikiLearn based on Open edX). That said, budgets are limited, and the number of people and their capacity to do this too. I'd personally prefer to spend my time doing actual work with (tools like) OpenRefine rather than training. What's the best bang for the (few) buck so that more people can actually work with data, and so that we need less time and money watching and creating videos?

There is indeed already "so much stuff" out there. It's indeed a challenge, and time consuming, to watch, read and pick the best for the job. One example: there are many online explainers for the cell.cross function but the single one that I find enlightening is this one due to its color coding and very visible explanations in the screenshots. However, it rarely drifts to the top in my search engine results and I always am looking for it again! A big issue is keeping things up to date age as OpenRefine changes. Some documentation that I created last year already needs/needed updates. I'm hesitant about the video creation because a few changes in the application may already render a video useless (after it was time consuming to produce).

Some type of in-app guide would be absolutely amazing, a real time saver. It would basically take away a lot of the need for the repetitive "how does OpenRefine work, what does everything mean, are all the buttons and menus" dance that each trainer now does. It would also make users feel much more confident that they're doing the right thing when trying the application themselves.

I found it very interesting that this drifted to the top. But yes, every group of trainees asks for this in some form. Basically, folks have some other dataset, either big or small, that they want to "fuzzily" compare with the main one they work with. I'd say it's what csv-reconcile and reconcile-csv try to do. Now that one (csv-reconcile) has a complex installation process (I'm not dumb, but I don't understand how to do it, and so I also don't actively recommend it to people I train and work with.) Compared to that, installing the Commons extension is a breeze :wink:

I'm surprised too. I don't have strong opinions, I think this really needs very in-depth user research.
In the various scenarios in which I use OpenRefine, it would only be useful if it comes with functionalities very similar to Google Docs: the ability to communicate with collaborators, tag them, explain what's been done at a certain point, tell them where to continue, what to repeat, etc.

I think it's quite interesting that quite a few of the Allourideas suggestions that wasn't popular are also things that happens to be personal focuses to some of us technical contributors (personally slightly surprised to see keyboard acceleration at a score of 28). At the same time some of the highly voted features seems to be quite low barrier development wise.

We need to be better at getting and retaining contributors, maybe it shows a little bit extra on Wikimedia related features and needs given it has been pushed by external funding rather that by organic contributions?

1 Like

I wonder if people aren't basically after easier sharing of projects/examples. Kinda like how WCQS makes it super easy for people to "collaborate" on Wikidata queries.

People can of course share the project anyway by file/export but it creates a higher barrier for both the person requesting help and potential helpers compared to an URL.

@abbe98 We'd be guessing, no? So @Sandra , why not send out an updated survey that simply asks those users "Please type what you mean by multi-user support. Describe the experience and detail how you wish it would work."

@thadguidry Instead of asking this small group, which represents only 21 people from a very specific user group, I suggest we use the bi-yearly survey to ask questions related to the roadmap. am planning to conduct the survey early in 2024, and we usually receive between 150 and 200 responses.

We can use the AllOutIdeas platform, which is very useful in ranking our features. Before running the survey, we should clarify certain items on the list and ask for more details about specific features, such as the multi-user option.

I make no other claim. Guessing is just as important as anything else when trying to figure out what user wants or how to phrase future options.

Do you have a link for that? A quick search show that this is a very generic phrase in common language… I'd like to take a look at it.

Regards, Antoine

This is the platform Martin refered to:

https://allourideas.org/

I did the survey in 2022. It got so many responses because I did quite a bit of effort actively posting about it in forums and mailing lists of various communities out there (and I didn't actively promote it in the Wikimedia community as that felt like a conflict of interest to me).

My main feedback about the user survey: it shows an aging user population. There are many new users whom the survey does not reach. In my day to day experience training new users today, they operate in a very different context - some have access to many specialized mini-tools, many use ChatGPT and fam for data cleaning tasks, etc. What would be a way to reach more new users?

I've always hoped that it would be possible to reach many more users by actively advertising the survey inside the tool itself. Just like users are notified very clearly of new software versions. And also announce the survey on the download page?

I've always hoped that it would be possible to reach many more users by actively advertising the survey inside the tool itself. Just like users are notified very clearly of new software versions.

I completely agree. We are in the process of making some changes to the way those new software version notifications are fetched, and that should make it easier to also support other types of notifications, such as notifying people of a survey. I have opened an issue about it and I would like to get this fixed for 3.8.

Obviously, this means that the notification will initially only be visible to people who have upgraded to the latest version: we cannot retrospectively patch the previous versions for them to also display it. (Or we could generate a fake GitHub release with a suitably chosen release name, to weaponize the new version notification to advertise the survey instead, but that would be really bad…)

And also announce the survey on the download page?

That's even easier to do, and would totally make sense too.

1 Like

@Sandra, thanks for sharing your experience. I agree with your and @antonin_d's suggestions.

In terms of timeline, I may only be able to look into it in March after I complete the CSCCE Community Playbooks Workshop (see details at Reflecting on 2023 and Looking Ahead to 2024 as OpenRefine Project Manager).

Before running the survey, we should clarify certain items on the list and ask for more details about specific features, such as the multi-user option.

Martin - I think it would be useful to involve the community in crafting the questionnaire to maximize the usefulness of the results.

I've always hoped that it would be possible to reach many more users by actively advertising the survey inside the tool itself. Just like users are notified very clearly of new software versions.

I completely agree. We are in the process of making some changes to the way those new software version notifications are fetched, and that should make it easier to also support other types of notifications, such as notifying people of a survey. I have opened an issue about it and I would like to get this fixed for 3.8.

Any privacy-leaking functions probably should be opt-in only. That includes fixing our existing update checker.

As long as we're adding opt-in features, perhaps it's time to reconsider this idea I first raised in 2014:

It's tempting to introduce opt-in usage reporting to try and get a better handle on usage, workloads, dataset characteristics, etc, but I don't know how that would go over.

9 1/2 years later, I think people are more used to anonymous usage metrics, so I'm much less reluctant to propose the idea.

Note that even without any additional functionality, we'd at least get invocation counts from the update check, which would be an improvement over the current situation where we only have download counts (which is why it needs to be opt-in).

Tom

1 Like

Yes, so far I think we don't even have a way to opt-out of the release notifications, so that's not ideal.

Since the discussion is diverging a bit from the original survey results, I'd continue it in this issue instead: