Exposing datatype statistics in columns

Column headers currently display some limited statistics about the values they contain: if the column is reconciled, a little bar chart shows the proportion of cells which are matched, marked as new or unmatched:

image

Other than that, columns are currently not associated with a particular datatype. To get an overview of which sorts of datatype a column contains, the user has to create a facet for it.

Various other ETL tools attempt to attribute datatypes to columns, based on the most common datatype in the cells contained in that column. This is useful to export the data to some formats, such as SQL or Data Package which embeds column-level metadata.

As part of the migration to the 4.0 architecture, I am in the process of changing the way recon statistics are computed: currently, the statistics are re-computed every time an operation is run, and I am delaying that to happen only when facets are computed. This is more efficient because in one single pass over the grid, we can compute the statistics of all facets as well as the recon statistics for all columns. I have used this opportunity to not just compute recon statistics, but also datatype statistics, so that we know how many times each supported datatype appears in each column.

The question is then: how do we want to display this information to the user?
Some random ideas on my side: we could decide to expand the existing histogram, with more colours for each datatype. I am not really sure this would be very usable since people would need to learn the colour scheme (which would have to be compatible with colour blindness). Also, if a column contains only one datatype (excluding null cells) then it would probably be nicer to just display the name of the datatype, rather than a bar full of the colour representing that datatype. The exact breakdown of datatypes could potentially be accessible after hovering or clicking - I can imagine it’s best to avoid overloading the column header with information that is shown all the time.

Let me know what you would find best! There is no need to figure it out super quick, for now I can just ignore the datatype information and just stick to the existing rendering of the recon stats.

I would prefer a Click for more detailed stats breakdown of a column…but in other tools, I often use their stat views across a few columns at a time and not just individually (it helps with quickly seeing overall outliers (<75 of all 3 columns are Strings, not Numbers)…so…maybe another sub-header div for show/hide datatype stats? within each column sub-header, it could be flexibly changed by user with a tiny toggle button for displaying the stats as:

  1. a bar
  2. enum list of the types?

Later, a data quality extension might add, overwrite, or extend the column sub-header to show data quality metrics or histograms. For example, in Trifacta’s data quality bar

Color bar Description
green Values that are valid for the current data type of the column
red Values that are mismatched for the column data type
gray Missing or null values

In short, beginning to allow a sub-header extension area for mini-facets might work well in the future. We might even consider moving some often used facets from the left-panel that ask a basic question into the column sub-header for datatype filtering such as isBlank, isNull, isType.

I think this has been discussed before e.g. Add new data quality or data profiling UI enhancements · Issue #5315 · OpenRefine/OpenRefine · GitHub and Defining a cell type for a validation mechanism for columns · Issue #2474 · OpenRefine/OpenRefine · GitHub

data type is really only one thing that could be measured for the validity of data in a column. I’d much prefer to see a development that was aimed at supporting a more comprehensive concept of checking column content. So for example not just that everything in the column is a string, but that all the strings match a particular regular expression (or some other rule)

With this in mind I think any UI design here needs to consider the possibility that there maybe multiple aspects to checking a column and reporting back to the user

That would fit very well in this new system too. I guess it's "just" about defining what sort of class of checks we should support, how they should be presented to the user, whether they should be extensible. Perhaps it would be a good topic for a design session in an event like the BarCamp I plan to organize.

If we are going for something more ambitious, then for now I will stick to reproducing the current behaviour (only show recon stats).

1 Like

I agree with Owen that this should be generalizable and I think one could do worse than use the Trifacta design as a starting point. There’s obviously a trade-off between information density and complexity / visual clutter. They’ve gone with two visualizations per column. They use a bar graph exactly like the reconciliation bar graph, except that they use it for Data Quality (by default?), plus a value histogram similar to OpenRefine’s numerical facet display, but for each column. Another useful display strategy, for simple numerical data, might be something similar to what Excel and other spreadsheets use to display min/max/mean, etc.