OpenRefine 2024 Barcamp: Making OpenRefine more useful as exploratory tool

In this session (see Barcamp page), we discussed how crucial data visualization is for research. When you get new data, evaluating its quality and creating visual representations is important. However, doing this in OpenRefine is difficult because the interface lacks good visualization features, and the facets are too small. Currently, the process involves:

  1. Cleaning the data in OpenRefine.
  2. Exporting the data to create visualizations in R or Excel.
  3. Identifying any outliers.
  4. Going back to OpenRefine to address any issues that were identified.
  5. Repeating the process if further inconsistencies are discovered.

While data integrity is a critical aspect, this session focuses on enhancing visualization capabilities in OpenRefine. However, we also recognize a strong interest in improving data integrity validation.

List of Visualization Improvements Identified

Overall Improvements

Several participants expressed the wish to allow Pop Out of the facet to allow for larger visuals

@mack-v suggested it would be beneficial to generate visualizations for reporting, saving, and sharing. The ability to save visualizations would be especially useful as OpenRefine projects are not easily shared. This feature could potentially be developed as a plugin.

Facet History

@jfaurel, @thadguidry, and @ostephens indicated that closing a facet accidentally results in losing it, which is problematic, especially with custom GREL expressions. It would be nice to see the history of created facets.

Data Type Enhancements

Associating a data type to a column or other constraints and being notified of inconsistencies would eliminate the need to create facets for this purpose. @thadguidry indicated this could be implemented using coloring or indicators for data type outliers. See the conversation Exposing datatype statistics in columns.

Basic Stats

@ostephens suggested better statistics on numeric values could be provided as discussed in #2001. The refine-stats extension is incompatible with the latest OpenRefine release and limits results to a pop-up instead of a facet.

Additional resources:

Histogram/Timeline Improvements

Limitations of the number and timeline facets include:

  • Uncontrollable binning
  • Ineffective graphs and sliders due to binning issues
  • Outliers significantly affect binning, making facets useless for "normal" ranges

Most participant expressed better control on boundaries is needed. See #5168 and #3248.

Enhancing Scatterplot Usability

@jfaurel and @ostephens demonstrated how the scatterplot can function as exploratory tools, but it is not very usable. Enhancements could include making it larger, allowing data drill-down, and starting from a blank canvas with selectable variables. Currently, selecting the right pair of columns is difficult when the project has numerous numerical columns.

Map Facet

@jfaurel reaffirmed that OpenRefine's scope is to clean datasets, not generate maps. It should identify outliers rather than produce publishable maps. Users currently use scatterplots masquerading as maps, which is not optimal. Having a proper map facet will allow to quickly visualize the distribution of long/lat coordinate to identify outlier.

@HannaMeiners suggested interactions with OpenStreetMap, possibly using tools like Leaflet. @martin asked about improving OpenRefine's relationship with the OSM community.

Additional resource: Georefine extension.


During the conversation we also discussed potential solution for implementation.
@ostephens asked what backend and frontend tools support this type of experience?

@thadguidry mentionned visualization, especially for geo-mapping, should live outside OpenRefine (or as an extension). Tools like Apache Superset could be used for analysis if we opened a data API for the data grid. Filtering and narrowing fall under OpenRefine's scope.

Mermaid was suggested as another option. Mermaid is a markdown-inspired diagramming tool, completely browser-based, making integration with web-based software easy. It could be used for rendering and customizing advanced diagrams in OpenRefine and for exporting diagrams for reports or documentation.

Several contributor discussed the option to open OpenRefine API for other tools to access its data

  • @Michael_Markert: Exposing data in OpenRefine via API to directly access the entire "dataframe" or individual columns would eliminate the need for saving, exporting, and re-importing for intermediary step. See #1226 for discussion about exposing the entire grid to external languages (e.g., R).
  • @thadguidry: This idea was discussed early (in 2010) to have a database within OpenRefine's backend but was postponed due to other priorities.