Results from the Feature Prioritization Survey 2024

Presentation

This post focuses on the results of the feature voting done along with the survey using the all-our-ideas platform. The vote took place between August 1, 2024, and October 7th, 2024 and out of the 226 survey particpant, 109 of them participated to the feature ranking survey (only participant to our bi-yearly survey had access to the link). A total of 9,077 votes were cast, for over 16 hours of culumated voting time overall. On average, each participant spent 9 minutes voting and answered 83 questions each. However, this average was influenced by a few highly active users, with most participants voting on fewer than 50 questions, while some answered over 300, and one participant reached as many as 1,280.

To enhance readability, I grouped similar suggestions together and adjusted their votes. I manually approved each user's suggestion to prevent spam and avoid duplicate entries. The final score includes calculating a weighted average for similar features or boosting the results for duplicates that weren't presented to participants.

It’s important to note that features with fewer total votes (especially those introduced late in the survey) had less exposure to participants than other suggestions, and it may have influenced their final score (positively or negatively). I left a note for the six suggestions with less than 208 votes to ensure fair interpretation (the 208 vote threshold was determined by taking the bottom 10% of the vote).

Interpreting the Scores:

The score assigned to each feature reflects its relative preference based on head-to-head comparisons made by participants during the survey. Each time a feature was selected over another, it earned a "win." The score is calculated based on the ratio of wins to the total number of vote. Here’s how to interpret the scores:

  • High Scores (Above 65): Calculated as the mean + 1 standard deviation, these indicate strong community preference. Features with high scores won more frequently, reflecting importance or desirability.
  • Middle Range Scores (35-65): These scores indicate mixed responses; while often ranked favorably, these features also lost some comparisons, suggesting they are useful but not critical to all users.
  • Low Scores (Below 35): Calculated as mean - 1 standard deviation, low scores indicate less frequent wins in comparisons. These features may represent more niche requests or less urgent needs.

Feature Ranking

The columns labeled Source indicate the origin of each feature suggestion:

  • User: A feature suggested by a participant, subsequently included in the vote.
  • Seed: A feature pre-selected as part of the initial list of options in the survey.

If you believe that any of the issue relationships listed here need to be updated (added or removed), please let me know. We want to ensure that all feature requests are accurately linked to the relevant discussions and issues on GitHub.

Score Source Feature Request Comment
76 User Remove duplicate rows: rows that already exist are deleted; optional feature: only select certain columns to be evaluated for comparison Related issue: #3218. This suggestion was introduced later in the survey and, therefore, received fewer votes compared to others.
76 Seed Native Reconciliation with arbitrary external datasets, e.g. csv reconciliation, reconciliation with another OpenRefine project. Reconciliation against any dataset, not only the ones with a reconciliation endpoint. Reconciliation locally is a very useful feature. I suggest local reconciliation based on projects in Openrefine instead of using external services. (from the survey survey open-ended questions) The current solution is to use one of the csv-based reconciliation services. Related issue: #2003
73 User Allowing import again for adding new rows to existing projects. Easy adding of additional rows/importing rows from other datasets. Update existing project with new data (append rows from a file). Extend datasets: add rows to existing datasets from files, assuming the same column structure/order. The inability to update imported data is a limiting factor (user survey) Related issue: #715
72 User Making joining records easier across 2 datasets by multiple keys! Need to create an issue
72 User Easier rename a column by clicking on it and don't break any facets that depend on it Loosely related issue: #6282
68 Seed Loading and working very large projects more easily/smoothly (100,000s of rows/records). Loading and working large datasets could increase OR's user base in both numbers and variety to a great extent in my opinion. Related to the scaling - 4.x branch
67 User Allow canceling at any time those long spinning operations like Clustering Clustering will have many improvement in the upcoming 3.9 version. Need to create an issue to cancel operation that take a long time to compute.
67 Seed Better support of nested structure (improved record mode) See discussion in Representing hierarchical data: beyond the records mode?
67 User More GREL options that allow for creating and using variables for your dataset Need to create an issue
65 Seed Option to save facet Related issue: #560
62 Seed Support Python 3 as an expression language - Continue maintaining Python / Jython option. Also requested survey open-ended questions. Related issue: #2249
61 Seed Drag and drop for columns Issue to be created
60 Seed Native Reconciliation against a SPARQL query This is supported via RDF extensions but may be not known by all users. How can we better advertise it?
59 Seed Support for more diverse (human language) alphabets/scripts, date and time formats... No related issue. This is question comes from of Diversity grant
58 User Allow bookmarking and naming starred GREL expressions so they can show in a Star top-level menu. (seed) / Allow Users to customize a Custom Menu to save macro (user suggestion) / More 'point and click' functions to replace GREL (seed) Discussions and the consensus is to promote option for the users to create macros. I merged the two seeded questions (individual scores of 60 and 54) and user (score of 64) suggestions for an average score of 59. This is loosely related to the suggestion of allowing the sharing and exploring of public expressions. Related issue: #109
58 User Quick delete all rows having empty cells Related issue: #1472
58 Seed In context help for GREL or wizard-like approach to writing GREL (like Excel) Mention in User Interviews Results Part 2: Exploring Feedback Regarding OpenRefine Feature and User Experience. This issue would require additional design work to better scope and refine the feature.
58 Seed More and better notifications, error messages, and warnings in OpenRefine This is track via many issues (and most likely many more to create) under the error handling tag in Github
58 Seed Multi-user support: allowing two or more people to work on the same project Related issue: #101 and discussion at the 2024 Barcamp OpenRefine 2024 Barcamp: OpenRefine as a Service
53 Seed Some simple data visualization features Related issue: #5315
53 User Don't go back to the beginning after matching during reconciliation Related issue: #33 and #6546. The improvement is part of the upcoming release of version 3.9. This suggestion was introduced later in the survey and therefore received fewer votes compared to others.
53 Seed Option to refactor the JSON operation scripts to edit a facet, update a GREL command, or add a step Related discussion Which reproducibility should we focus on? - #5 by Martin
51 Seed Improved integration with cloud storage services for data import and export. Need to create an issue and better scope this feature.
50 User Supported REST API for external use This is already supported via add a column by fetching url. During the 2024 BarCamp we discussed supporting OpenAPI within OpenRefine OpenRefine 2024 Barcamp: Support OpenAPI in OpenRefine
49 Seed Improved JSON parsing when calling API Related issues: #1440 #2515
49 User Faster rendering of many columns in Record mode
48 User Integrate a call to HuggingFace AI models to automate tasks (see HuggingSheet) @Michael_Markert show us here how to integrate OpenRefine with a LLM: Using local ChatGPT-like LLMs in OpenRefine for data wrangling. This suggestion was introduced later in the survey and, therefore, received fewer votes compared to others.
47 Seed Make OpenRefine easier to learn and get started with better or easier UX / interface This would need dedicated design effort
46 Seed Pause and resume my operations in OpenRefine Related discussion: Partial results of long-running operations
46 Seed Save Template exports Related issues: #1928 #468
45 Seed Allow users to set precise values for numeric facets Related issues: #5168 #5008
44 Seed Less abandoned OpenRefine extensions: only present maintained and currently operational ones A participant indicated in the open-ended question of the survey that many plugins and services are VERY DATED and look abandoned. This needs a referesh. See also conversation in Improving the UX of extension install, and Butterfly
44 Seed Better support of MARC format for complex dimensions and repeating elements. Related issues: #794 #2127
44 Seed A walkthrough tutorial inside the software itself, to introduce and guide new users This would need dedicated design effort
43 User Ability to extend data by bringing in qualifiers from Wikidata This suggestion was introduced later in the survey and therefore received fewer votes compared to others.
41 User Supported client/client library based on REST API
40 User AI integrated help for writing regular expressions, GREL etc May be related to the suggestion In context help for GREL or wizard-like approach to writing GREL (like Excel)
38 User Default Wikimedia support as a core OpenRefine feature
37 User Add transform for book-style Title casing Need to create an issue
37 User Allow sharing and exploring of public expressions This is discussed in Which reproducibility should we focus on? . This is loosely related to #109
35 Seed Faster upload to Wikibase, Wikidata, or other Wikimedia projects – Fully maintained production Wikidata reconciliation service with better reconciliation and performance See the summary of the discussion during the 2024 Barcamp OpenRefine 2024 Barcamp:: Reconciliation in OpenRefine
34 Seed An online, hosted instance of OpenRefine This is often requested by the trainer as a replacement for the unstable mybinder deployment. See also the summary of the discussion during the 2024 Barcamp OpenRefine 2024 Barcamp: OpenRefine as a Service
34 Seed A keyboard-accessible GUI This is supported via the keyboard acceleration extension prototype extension. See discussion Keyboard acceleration extension prototype and repo
32 User Work on reconciliation of Wikidata Lexemes Related issue: #2240 and forum discussion OpenRefine support for Lexemes in Wikidata: how would you use this?. This suggestion was introduced later in the survey and therefore received fewer votes compared to others.
32 Seed Delete multiple projects at once Related issue: #4965
29 User Parquet import/export Related issue: #1929
26 User Easy start/stop of OpenRefine on Windows Related issue: #3221
24 User Better support for SELF HOSTED wikidata instances. (setting up manifests, and creating data previews (when reconciling) is full of dark secrets. Wikibase cloud reconciliation - Improved integration with Wikibase would be important to me because right now I have to make do with some workarounds that can be time-consuming I suppose this suggestion is related to wikibase instance and not OpenRefine itself. See this thread regarding the effort to make this process easier. Fundraising to commission the development of a MediaWiki extension for reconciliation with Wikibase. This suggestion was introduced later in the survey and therefore received fewer votes compared to others.
20 User Support HDF5 importer and selecting a file within it Related issue: #640
20 Seed Support R as an expression language Related issue: #1226

Additional Feature Requests

These are suggestions gathered from the open-ended questions in the user survey. They were not part of the options available for voting but provide valuable insights into user needs and potential improvements.

  • Dark mode would be greatly appreciated: Related issue: #3017
  • An official docker hub image would be nice. This is available in GitHub - OpenRefine/containers: Collection of containerized packages of OpenRefine see discussion Proposal for a new repository: containerizations for OpenRefine
  • any features about geographic coordinates will be very useful. Related issue: #6570 and forum discussion OpenRefine 2024 Barcamp: Making OpenRefine more useful as exploratory tool
  • Connect with Zotero for reconciliation and publication. Issue to be created.
  • ODS spreadsheets fail to upload. Related issues: #6877, #3055, #2243
  • Adjust columns. Related issue: #4806
  • Increase the size of the preview window when we are working on the column. I work with really long values, and sometimes, I can't even see one full value in the preview.
  • I really wish there was a setting config in the GUI that would highlight what could be modified and what are the current configuration files even if it cant be modified from the menu. This would expose what could be customized by the user in the config and provide a guide on how to extend or what settings could be modified. In particular, this could be used to highlight new defaults and new reconciliation services and extensions both of which are only really visible if you dive deep into the help or are working in one of those areas. The ability to discover that they exist at all, if they changed, or their current status (ie broken, working, slowed) etc would benefit from a settings menu in the OpenRefine GUI for cross-discovery and lowering the entry point to OpenRefine's more useful integrations.
1 Like

Thanks Martin! That will take a while to process, but a few quick questions off the top of my head:

  • will the raw vote tallies themselves be released? There's a pretty big range in the 1-208 vote span (those that received a special comment) which could significantly affect score interpretation. It would also be interesting to look at different visualizations for votes per user and votes per day.
  • since the "mean" is exactly 50, it's presumably normalized in some way. Can you describe the normalization?
  • is the score of wins/total done using the total votes for a given entry or the total votes overall? Hopefully the former so they're not biased by the number of times an entry was presented for voting.
  • the 1280 vote user (and perhaps the 640 & 641 vote users) seem like significant outliers. Did you do any further analysis to see if there's anything suspicious going on there? Or sensitivity analysis to see what results look like with that/those voter(s) excluded?
  • was the phrase "Loading and working large datasets could increase OR's user base in both numbers and variety to a great extent in my opinion." actually included in the question? Hopefully not!

Tom

@tfmorris here is the raw data: choice_votes_fc4d421d-b2f3-4f8a-aaa6-9d56194df9b1.xlsx (1.1 MB) the first tab include the clean up I did (merging question and adjusting score).

Let me clarify how the scoring is done.

  • All our ideas present a pair of statements to the participant.
  • The participant votes one or the other (or suggests something else). This is what all ideas call one vote. A vote can be a win or a loss.
  • On average each statement received 350 votes.
  • The score (presented in my analysis) represents how many times the statement won the vote vs the number of votes.
  • The mean of 50 indicates that a statement won as many votes as it lost.

I didn't investigate the person who recorded 1280 votes.

A user shared the statement, "Loading and working large datasets could increase OR's user base in both numbers and variety to a great extent, in my opinion." I merged it with the statement, "Loading and working very large projects more easily/smoothly (100,000s of rows/records)."

Thanks for the quick reply! Since the top feature request got 208 votes (vs a mean of 350) and was a 3:1 winner, I think we can probably remove the asterisk next to it.

I'll see if I find anything odd about the outlying voters. Since the top 3 voters represent over 25% of the votes cast, it seems worth a little investigation.

Tom

p.s. of course, the average score is 50! I wasn't thinking clearly when I asked that question.

I'm happy to report that I spent about 2 hours during an afternoon, clicking thru the AllOurIdeas survey. So maybe the person who recorded 1280 votes was me :slight_smile: maybe not.

1 Like