Results from the Feature Prioritization Survey 2024

Martin · October 25, 2024, 4:46pm

Presentation

This post focuses on the results of the feature voting done along with the survey using the all-our-ideas platform. The vote took place between August 1, 2024, and October 7th, 2024 and out of the 226 survey particpant, 109 of them participated to the feature ranking survey (only participant to our bi-yearly survey had access to the link). A total of 9,077 votes were cast, for over 16 hours of culumated voting time overall. On average, each participant spent 9 minutes voting and answered 83 questions each. However, this average was influenced by a few highly active users, with most participants voting on fewer than 50 questions, while some answered over 300, and one participant reached as many as 1,280.

To enhance readability, I grouped similar suggestions together and adjusted their votes. I manually approved each user's suggestion to prevent spam and avoid duplicate entries. The final score includes calculating a weighted average for similar features or boosting the results for duplicates that weren't presented to participants.

It’s important to note that features with fewer total votes (especially those introduced late in the survey) had less exposure to participants than other suggestions, and it may have influenced their final score (positively or negatively). I left a note for the six suggestions with less than 208 votes to ensure fair interpretation (the 208 vote threshold was determined by taking the bottom 10% of the vote).

Interpreting the Scores:

The score assigned to each feature reflects its relative preference based on head-to-head comparisons made by participants during the survey. Each time a feature was selected over another, it earned a "win." The score is calculated based on the ratio of wins to the total number of vote. Here’s how to interpret the scores:

High Scores (Above 65): Calculated as the mean + 1 standard deviation, these indicate strong community preference. Features with high scores won more frequently, reflecting importance or desirability.
Middle Range Scores (35-65): These scores indicate mixed responses; while often ranked favorably, these features also lost some comparisons, suggesting they are useful but not critical to all users.
Low Scores (Below 35): Calculated as mean - 1 standard deviation, low scores indicate less frequent wins in comparisons. These features may represent more niche requests or less urgent needs.

Feature Ranking

The columns labeled Source indicate the origin of each feature suggestion:

User: A feature suggested by a participant, subsequently included in the vote.
Seed: A feature pre-selected as part of the initial list of options in the survey.

If you believe that any of the issue relationships listed here need to be updated (added or removed), please let me know. We want to ensure that all feature requests are accurately linked to the relevant discussions and issues on GitHub.

Score	Source	Feature Request	Comment
76	User	Remove duplicate rows: rows that already exist are deleted; optional feature: only select certain columns to be evaluated for comparison	Related issue: #3218. This suggestion was introduced later in the survey and, therefore, received fewer votes compared to others.
76	Seed	Native Reconciliation with arbitrary external datasets, e.g. csv reconciliation, reconciliation with another OpenRefine project. Reconciliation against any dataset, not only the ones with a reconciliation endpoint. Reconciliation locally is a very useful feature. I suggest local reconciliation based on projects in Openrefine instead of using external services. (from the survey survey open-ended questions)	The current solution is to use one of the csv-based reconciliation services. Related issue: #2003
73	User	Allowing import again for adding new rows to existing projects. Easy adding of additional rows/importing rows from other datasets. Update existing project with new data (append rows from a file). Extend datasets: add rows to existing datasets from files, assuming the same column structure/order. The inability to update imported data is a limiting factor (user survey)	Related issue: #715
72	User	Making joining records easier across 2 datasets by multiple keys!	Need to create an issue
72	User	Easier rename a column by clicking on it and don't break any facets that depend on it	Loosely related issue: #6282
68	Seed	Loading and working very large projects more easily/smoothly (100,000s of rows/records). Loading and working large datasets could increase OR's user base in both numbers and variety to a great extent in my opinion.	Related to the `scaling - 4.x branch`
67	User	Allow canceling at any time those long spinning operations like Clustering	Clustering will have many improvement in the upcoming 3.9 version. Need to create an issue to cancel operation that take a long time to compute.
67	Seed	Better support of nested structure (improved record mode)	See discussion in Representing hierarchical data: beyond the records mode?
67	User	More GREL options that allow for creating and using variables for your dataset	Need to create an issue
65	Seed	Option to save facet	Related issue: #560
62	Seed	Support Python 3 as an expression language - Continue maintaining Python / Jython option.	Also requested survey open-ended questions. Related issue: #2249
61	Seed	Drag and drop for columns	Issue to be created
60	Seed	Native Reconciliation against a SPARQL query	This is supported via RDF extensions but may be not known by all users. How can we better advertise it?
59	Seed	Support for more diverse (human language) alphabets/scripts, date and time formats...	No related issue. This is question comes from of Diversity grant
58	User	Allow bookmarking and naming starred GREL expressions so they can show in a Star top-level menu. (seed) / Allow Users to customize a Custom Menu to save macro (user suggestion) / More 'point and click' functions to replace GREL (seed)	Discussions and the consensus is to promote option for the users to create macros. I merged the two seeded questions (individual scores of 60 and 54) and user (score of 64) suggestions for an average score of 59. This is loosely related to the suggestion of allowing the sharing and exploring of public expressions. Related issue: #109
58	User	Quick delete all rows having empty cells	Related issue: #1472
58	Seed	In context help for GREL or wizard-like approach to writing GREL (like Excel)	Mention in User Interviews Results Part 2: Exploring Feedback Regarding OpenRefine Feature and User Experience. This issue would require additional design work to better scope and refine the feature.
58	Seed	More and better notifications, error messages, and warnings in OpenRefine	This is track via many issues (and most likely many more to create) under the error handling tag in Github
58	Seed	Multi-user support: allowing two or more people to work on the same project	Related issue: #101 and discussion at the 2024 Barcamp OpenRefine 2024 Barcamp: OpenRefine as a Service
53	Seed	Some simple data visualization features	Related issue: #5315
53	User	Don't go back to the beginning after matching during reconciliation	Related issue: #33 and #6546. The improvement is part of the upcoming release of version 3.9. This suggestion was introduced later in the survey and therefore received fewer votes compared to others.
53	Seed	Option to refactor the JSON operation scripts to edit a facet, update a GREL command, or add a step	Related discussion Which reproducibility should we focus on? - #5 by Martin
51	Seed	Improved integration with cloud storage services for data import and export.	Need to create an issue and better scope this feature.
50	User	Supported REST API for external use	This is already supported via add a column by fetching url. During the 2024 BarCamp we discussed supporting OpenAPI within OpenRefine OpenRefine 2024 Barcamp: Support OpenAPI in OpenRefine
49	Seed	Improved JSON parsing when calling API	Related issues: #1440 #2515
49	User	Faster rendering of many columns in Record mode
48	User	Integrate a call to HuggingFace AI models to automate tasks (see HuggingSheet)	@Michael_Markert show us here how to integrate OpenRefine with a LLM: Using local ChatGPT-like LLMs in OpenRefine for data wrangling. This suggestion was introduced later in the survey and, therefore, received fewer votes compared to others.
47	Seed	Make OpenRefine easier to learn and get started with better or easier UX / interface	This would need dedicated design effort
46	Seed	Pause and resume my operations in OpenRefine	Related discussion: Partial results of long-running operations
46	Seed	Save Template exports	Related issues: #1928 #468
45	Seed	Allow users to set precise values for numeric facets	Related issues: #5168 #5008
44	Seed	Less abandoned OpenRefine extensions: only present maintained and currently operational ones	A participant indicated in the open-ended question of the survey that many plugins and services are VERY DATED and look abandoned. This needs a referesh. See also conversation in Improving the UX of extension install, and Butterfly
44	Seed	Better support of MARC format for complex dimensions and repeating elements.	Related issues: #794 #2127
44	Seed	A walkthrough tutorial inside the software itself, to introduce and guide new users	This would need dedicated design effort
43	User	Ability to extend data by bringing in qualifiers from Wikidata	This suggestion was introduced later in the survey and therefore received fewer votes compared to others.
41	User	Supported client/client library based on REST API
40	User	AI integrated help for writing regular expressions, GREL etc	May be related to the suggestion In context help for GREL or wizard-like approach to writing GREL (like Excel)
38	User	Default Wikimedia support as a core OpenRefine feature
37	User	Add transform for book-style Title casing	Need to create an issue
37	User	Allow sharing and exploring of public expressions	This is discussed in Which reproducibility should we focus on? . This is loosely related to #109
35	Seed	Faster upload to Wikibase, Wikidata, or other Wikimedia projects – Fully maintained production Wikidata reconciliation service with better reconciliation and performance	See the summary of the discussion during the 2024 Barcamp OpenRefine 2024 Barcamp:: Reconciliation in OpenRefine
34	Seed	An online, hosted instance of OpenRefine	This is often requested by the trainer as a replacement for the unstable mybinder deployment. See also the summary of the discussion during the 2024 Barcamp OpenRefine 2024 Barcamp: OpenRefine as a Service
34	Seed	A keyboard-accessible GUI	This is supported via the keyboard acceleration extension prototype extension. See discussion Keyboard acceleration extension prototype and repo
32	User	Work on reconciliation of Wikidata Lexemes	Related issue: #2240 and forum discussion OpenRefine support for Lexemes in Wikidata: how would you use this?. This suggestion was introduced later in the survey and therefore received fewer votes compared to others.
32	Seed	Delete multiple projects at once	Related issue: #4965
29	User	Parquet import/export	Related issue: #1929
26	User	Easy start/stop of OpenRefine on Windows	Related issue: #3221
24	User	Better support for SELF HOSTED wikidata instances. (setting up manifests, and creating data previews (when reconciling) is full of dark secrets. Wikibase cloud reconciliation - Improved integration with Wikibase would be important to me because right now I have to make do with some workarounds that can be time-consuming	I suppose this suggestion is related to wikibase instance and not OpenRefine itself. See this thread regarding the effort to make this process easier. Fundraising to commission the development of a MediaWiki extension for reconciliation with Wikibase. This suggestion was introduced later in the survey and therefore received fewer votes compared to others.
20	User	Support HDF5 importer and selecting a file within it	Related issue: #640
20	Seed	Support R as an expression language	Related issue: #1226

Additional Feature Requests

These are suggestions gathered from the open-ended questions in the user survey. They were not part of the options available for voting but provide valuable insights into user needs and potential improvements.

Dark mode would be greatly appreciated: Related issue: #3017
An official docker hub image would be nice. This is available in GitHub - OpenRefine/containers: Collection of containerized packages of OpenRefine see discussion Proposal for a new repository: containerizations for OpenRefine
any features about geographic coordinates will be very useful. Related issue: #6570 and forum discussion OpenRefine 2024 Barcamp: Making OpenRefine more useful as exploratory tool
Connect with Zotero for reconciliation and publication. Issue to be created.
ODS spreadsheets fail to upload. Related issues: #6877, #3055, #2243
Adjust columns. Related issue: #4806
Increase the size of the preview window when we are working on the column. I work with really long values, and sometimes, I can't even see one full value in the preview.
I really wish there was a setting config in the GUI that would highlight what could be modified and what are the current configuration files even if it cant be modified from the menu. This would expose what could be customized by the user in the config and provide a guide on how to extend or what settings could be modified. In particular, this could be used to highlight new defaults and new reconciliation services and extensions both of which are only really visible if you dive deep into the help or are working in one of those areas. The ability to discover that they exist at all, if they changed, or their current status (ie broken, working, slowed) etc would benefit from a settings menu in the OpenRefine GUI for cross-discovery and lowering the entry point to OpenRefine's more useful integrations.

tfmorris · October 25, 2024, 5:50pm

Thanks Martin! That will take a while to process, but a few quick questions off the top of my head:

will the raw vote tallies themselves be released? There's a pretty big range in the 1-208 vote span (those that received a special comment) which could significantly affect score interpretation. It would also be interesting to look at different visualizations for votes per user and votes per day.
since the "mean" is exactly 50, it's presumably normalized in some way. Can you describe the normalization?
is the score of wins/total done using the total votes for a given entry or the total votes overall? Hopefully the former so they're not biased by the number of times an entry was presented for voting.
the 1280 vote user (and perhaps the 640 & 641 vote users) seem like significant outliers. Did you do any further analysis to see if there's anything suspicious going on there? Or sensitivity analysis to see what results look like with that/those voter(s) excluded?
was the phrase "Loading and working large datasets could increase OR's user base in both numbers and variety to a great extent in my opinion." actually included in the question? Hopefully not!

Tom

Martin · October 25, 2024, 6:07pm

@tfmorris here is the raw data: choice_votes_fc4d421d-b2f3-4f8a-aaa6-9d56194df9b1.xlsx (1.1 MB) the first tab include the clean up I did (merging question and adjusting score).

Let me clarify how the scoring is done.

All our ideas present a pair of statements to the participant.
The participant votes one or the other (or suggests something else). This is what all ideas call one vote. A vote can be a win or a loss.
On average each statement received 350 votes.
The score (presented in my analysis) represents how many times the statement won the vote vs the number of votes.
The mean of 50 indicates that a statement won as many votes as it lost.

I didn't investigate the person who recorded 1280 votes.

A user shared the statement, "Loading and working large datasets could increase OR's user base in both numbers and variety to a great extent, in my opinion." I merged it with the statement, "Loading and working very large projects more easily/smoothly (100,000s of rows/records)."

tfmorris · October 25, 2024, 6:50pm

Thanks for the quick reply! Since the top feature request got 208 votes (vs a mean of 350) and was a 3:1 winner, I think we can probably remove the asterisk next to it.

I'll see if I find anything odd about the outlying voters. Since the top 3 voters represent over 25% of the votes cast, it seems worth a little investigation.

Tom

p.s. of course, the average score is 50! I wasn't thinking clearly when I asked that question.

thadguidry · October 26, 2024, 3:00am

I'm happy to report that I spent about 2 hours during an afternoon, clicking thru the AllOurIdeas survey. So maybe the person who recorded 1280 votes was me maybe not.

abbe98 · November 12, 2024, 12:59pm

Related proposal that I'm working on over at the reconciliation-spec repository: Proposal: Structured Previews by Abbe98 · Pull Request #182 · reconciliation-api/specs · GitHub

Topic		Replies	Views
Drafting the 2024 User Survey Community	27	779	July 31, 2024
Results of two user surveys for Wikimedia Commons users of OpenRefine Community announcements wikimedia-commons	16	1006	January 13, 2024
👋 Introductions thread! Community Feedback	133	4434	February 26, 2026
OpenRefine 2032 ... what direction does OpenRefine want to go? Community Feedback roadmap	9	886	October 21, 2023
Work plan for the reproducibility improvements project Development & Design	6	1173	March 6, 2025

Results from the Feature Prioritization Survey 2024

Presentation

Interpreting the Scores:

Feature Ranking

Additional Feature Requests

Related topics