As promised, here is a post to present the work I have been doing lately on recipe visualization.
The goal of this work is to offer a visual representation of a list of operations. The hope is that it would let users get a better overview of what the recipe does. The use cases I have in mind are:
- when looking at the history of a project, get a more understandable and compact view of the operations done, so that it's easier to select which step to go back to with the undo functionality;
- when creating a recipe in the "Extract" dialog, make it easier to select which operations should be part of the recipe, by displaying the inter-dependencies between the selected operations;
- when applying a recipe in the "Apply" dialog, understanding the role of the input and output columns of the recipe, by showing how they are used inside the recipe
- ease the auditing of data cleaning processes (instead of reading long JSON payloads), for instance in scientific articles, reviewing of Wikidata bots based on OpenRefine, or other external use cases.
However after a few user testing sessions, it's clear that making this really usable is quite some work, and I don't know if it should be a priority compared to other enhancements (see at the end). So that's why I want to solicit your feedback!
Current state of the functionality
Here is a quick preview of what the visualizations currently look like:
In the screenshot above, each operation is represented by a node (with the icon corresponding to the menu entry it was triggered from), and one can read the full description of the operation by hovering it. Each column involved in the recipe is drawn as a vertical line. The operation is placed on the column it creates or modifies. Columns which are read by an operation without being modified are drawn with a "dot" on the column (as in the last operation of the recipe, which saves a Wikibase schema involving four columns).
Some operations don't expose which columns they modify or create, because the exact set of such columns depends on the data they are executed on. Such operations are currently displayed as "boxes" spanning the entire width of the recipe (as they can potentially interact with all columns). This is the case of the operation to split a column into multiple columns, in the following example:
Most of those opaque operations don't generalize well, similarly to the "reorder-columns" operation. One hope I have would be that users would learn to avoid such opaque operations when crafting reusable recipes, and visualizations would help them identify them quickly.
One other hope I have for this feature would be to help keep track of the provenance of information in columns. It is not uncommon to have multiple columns containing the same kind of information, coming from different sources, so that we can compare those fields as part of the data cleaning process. For instance, when importing data to Wikidata, I would often fetch the existing data on Wikidata to compare it to the external data, so that I can analyze inconsistencies. This often leads to having columns with similar names, which are easy to mix up. This type of visualization could help identify that, by getting a quick overview of how a column is used throughout the recipe. The following example (taken from this Wikidata tutorial) gives an idea of this sort of provenance analysis, by showing how many columns are derived from the "orcid" column after fetching URLs from it:
Needs identified during testing
- Various tweaks to the visual choices made to generate the visualizations, to make them more intuitive: for instance, moving columns is currently very unintuitive, so are the "dots" corresponding to columns read. The layout dimensions could of course also be adjusted (to make the icons bigger and more readable).
- Facets applied to an operation are currently displayed as additional columns read by the operation, but it would be better to make it possible to see what facet was applied
- Users don't feel "in control" of the recipe, they cannot interact with it as much as when it is represented in JSON. The ability to make some changes to the recipe directly from the visualization view would be helpful, such as deleting certain operations, or reordering steps.
- Overall, it would make sense to always offer the visualization as an alternative to the existing JSON representation, as users are used to it and shouldn't be forced out of it.
Trying it out
You can check out the work
branch of my fork and run OpenRefine from there to give it a spin. It's currently only integrated in the "Appy" and "Extract" dialogs - it's an open question if and how to integrate it in the undo/redo tab itself.
What future for this prototype?
As I wrote above, it's not clear to me whether it's worth investing more effort into this. I never expected the representation to be intuitive to everyone from the first glimpse - I think it could still be something useful after some getting used to, but it's also clear that it's a big jump and my time could be better spent on other improvements. This representation is just building on top of the infrastructure I built to enable the mapping of columns when applying a recipe (and that feature seems more useful and easier to roll out), so it can well be that the visual representation was mostly useful for me (as mental representation) when building that mapping feature, and isn't really meant to be shown to end users.
The other tasks I have on my mind are:
- Improving the generalizability of operations (reducing the number of opaque operations), which will have an impact on the column mapping UI (doing a better job at exposing the correct columns to be mapped)
- Offering a library of recipes inside OpenRefine, similarly to favorited GREL/Jython expressions? Potentially letting users integrate some recipes in the UI for better accessibility (for recipes used often)
- Improving the operation icons and rolling them out in more places (undo/redo panel)
Let me know what you think!