Recipe visualization prototype

As promised, here is a post to present the work I have been doing lately on recipe visualization.

The goal of this work is to offer a visual representation of a list of operations. The hope is that it would let users get a better overview of what the recipe does. The use cases I have in mind are:

  • when looking at the history of a project, get a more understandable and compact view of the operations done, so that it's easier to select which step to go back to with the undo functionality;
  • when creating a recipe in the "Extract" dialog, make it easier to select which operations should be part of the recipe, by displaying the inter-dependencies between the selected operations;
  • when applying a recipe in the "Apply" dialog, understanding the role of the input and output columns of the recipe, by showing how they are used inside the recipe
  • ease the auditing of data cleaning processes (instead of reading long JSON payloads), for instance in scientific articles, reviewing of Wikidata bots based on OpenRefine, or other external use cases.

However after a few user testing sessions, it's clear that making this really usable is quite some work, and I don't know if it should be a priority compared to other enhancements (see at the end). So that's why I want to solicit your feedback!

Current state of the functionality

Here is a quick preview of what the visualizations currently look like:

In the screenshot above, each operation is represented by a node (with the icon corresponding to the menu entry it was triggered from), and one can read the full description of the operation by hovering it. Each column involved in the recipe is drawn as a vertical line. The operation is placed on the column it creates or modifies. Columns which are read by an operation without being modified are drawn with a "dot" on the column (as in the last operation of the recipe, which saves a Wikibase schema involving four columns).

Some operations don't expose which columns they modify or create, because the exact set of such columns depends on the data they are executed on. Such operations are currently displayed as "boxes" spanning the entire width of the recipe (as they can potentially interact with all columns). This is the case of the operation to split a column into multiple columns, in the following example:

Most of those opaque operations don't generalize well, similarly to the "reorder-columns" operation. One hope I have would be that users would learn to avoid such opaque operations when crafting reusable recipes, and visualizations would help them identify them quickly.

One other hope I have for this feature would be to help keep track of the provenance of information in columns. It is not uncommon to have multiple columns containing the same kind of information, coming from different sources, so that we can compare those fields as part of the data cleaning process. For instance, when importing data to Wikidata, I would often fetch the existing data on Wikidata to compare it to the external data, so that I can analyze inconsistencies. This often leads to having columns with similar names, which are easy to mix up. This type of visualization could help identify that, by getting a quick overview of how a column is used throughout the recipe. The following example (taken from this Wikidata tutorial) gives an idea of this sort of provenance analysis, by showing how many columns are derived from the "orcid" column after fetching URLs from it:

Needs identified during testing

  • Various tweaks to the visual choices made to generate the visualizations, to make them more intuitive: for instance, moving columns is currently very unintuitive, so are the "dots" corresponding to columns read. The layout dimensions could of course also be adjusted (to make the icons bigger and more readable).
  • Facets applied to an operation are currently displayed as additional columns read by the operation, but it would be better to make it possible to see what facet was applied
  • Users don't feel "in control" of the recipe, they cannot interact with it as much as when it is represented in JSON. The ability to make some changes to the recipe directly from the visualization view would be helpful, such as deleting certain operations, or reordering steps.
  • Overall, it would make sense to always offer the visualization as an alternative to the existing JSON representation, as users are used to it and shouldn't be forced out of it.

Trying it out

You can check out the work branch of my fork and run OpenRefine from there to give it a spin. It's currently only integrated in the "Appy" and "Extract" dialogs - it's an open question if and how to integrate it in the undo/redo tab itself.

What future for this prototype?

As I wrote above, it's not clear to me whether it's worth investing more effort into this. I never expected the representation to be intuitive to everyone from the first glimpse - I think it could still be something useful after some getting used to, but it's also clear that it's a big jump and my time could be better spent on other improvements. This representation is just building on top of the infrastructure I built to enable the mapping of columns when applying a recipe (and that feature seems more useful and easier to roll out), so it can well be that the visual representation was mostly useful for me (as mental representation) when building that mapping feature, and isn't really meant to be shown to end users.

The other tasks I have on my mind are:

  • Improving the generalizability of operations (reducing the number of opaque operations), which will have an impact on the column mapping UI (doing a better job at exposing the correct columns to be mapped)
  • Offering a library of recipes inside OpenRefine, similarly to favorited GREL/Jython expressions? Potentially letting users integrate some recipes in the UI for better accessibility (for recipes used often)
  • Improving the operation icons and rolling them out in more places (undo/redo panel)

Let me know what you think!

3 Likes

Wow Antonin, c’est vraiment incroyable !

This is just great! I really love it. I’ll need more time to understand all the implication of this design, but it’s really something that will make OR a much better tool.

This is quite impressive.

Good work, I’m coming back once I digested thru. But meanwhile, kudos to you!

Best regards, Antoine

This is great! The overwhelming feeling comes from all the possibilities this new interface offers. Before expanding the scope or getting further into the design, should we merge and benefit from what you have already developed?

I’d suggest breaking down what you have so far into individual PR. For example, I am particularly interested in:

  • Checking if the columns are present when applying a JSON transformation. The interface to remap is useful and can benefit many users.
  • Providing a visualization of the steps alongside the JSON export in a new tab as a beta or preview feature could gently introduce it to users without disrupting their existing workflow. This would also allow us to gather more feedback.
  • Alternatively, adding a Visualize button under the undo/redo tab could bring up the visualization in full screen. This would require a dedicated modal since the image can be wide for projects with many columns.

I am also curious about the prototype status regarding scenarios 0A and 0B defined in Which reproducibility should we focus on? - #3 by tfmorris

I think it relates to

The feature ranked in the top 30% of features ranked in the Results from the Feature Prioritization Survey 2024. While I’m not a developer, this seems like a large project that might be best handled separately."

Yes, please! I initially expected to see the icon under the undo/redo. During my test, I found the icon to be super useful. Within ten minutes, my eyes were instinctively searching for the icon that matched the operation I wanted to perform, rather than reading the text.

This feature relies on the improvements to the backend that I have been making for the past few months (such as Introduce the Recipe class to hold a list of operations by wetneb · Pull Request #7116 · OpenRefine/OpenRefine · GitHub).
I have been waiting a few weeks for each of those pull requests before merging them without review, given that I am essentially the only active Java reviewer at the moment. By default, I would just continue doing that until I reach the commits which implement the visualization. I am happy to adjust the waiting time for each PR, change the way I structure the PRs, or make other adjustments.

I’d suggest breaking down what you have so far into individual PR.

They are already broken down in individual commits on my work branch, I just can't submit them yet because they are blocked by dependencies on changes to the backend.

My changes don't address those use cases as there still isn't a way to replay a whole workflow, including the import stage.
Changing the Extract dialog to include the import metadata isn't hard, but then there isn't a clear way how to make it possible for the user to reapply it. It's a feature that should likely be integrated in the home screen (since it's meant to be used without having created a project yet), in a way that still makes it possible to select the new dataset to run the workflow on.
So it's something that would require a larger design effort.

An alternative would be the creation of a CLI utility which would be able to run entire workflows by specifying the input file, import settings, operations and export settings in one go (in the spirit of orcli) My understanding is that it's something @tfmorris opposes as being outside of OpenRefine's scope.

1 Like

I am OK with merging as long as tests past and there's no regression upon any features of OpenRefine - I'll be sure to help test on this.

My only ask is that I'd like to see the Recipe visualization as unobtrusive as possible for existing users. So a small button somewhere as was suggested by @Martin and I'd make that button look something like this (with hovertext Visualize Recipe) :

1 Like