Apply facets holistically to each row rather than to each record

I frequently find myself frustrated when I have been working in rows mode with many facets/filters and I then change to records mode and start seeing many more records than I was seeing before. I recognize that this is because facets and filters are applied to records without regard to which row they apply to- e.g. if I have a record with three rows and the value in column A in one of those rows is "a" and the value in column B for another of those rows is "b", then that record will show up if I have a text facet on column A selecting "a" and one on Column B selecting "b", even though none of the rows in the column would show up if I was in rows mode.

Proposed solution

At least to me, it would be more intuitive if facets selected for records that contain at least one row that meets all of the facets by itself rather than selecting for records that contain at least one row that meets each of the facets, even if that can only be achieved by a combination of rows. If facets only selected the records that contain row which show up in rows mode, that would ensure more consistency between the behavior of the two modes, and it would allow for what I suspect is a very common use-case, wherein a user is looking at a list of rows and wants to see the other rows in the same record as each of those rows, but is not interested in any other records. The unintuitive nature of the current behavior can also be seen by creating two facets and selecting mutually exclusive values. If I facet by blank twice on a column where there can be both blank and nonblank cells in a single record, and I select true in one facet and false on the other, I will still see some records while in records mode, even though no such rows exist in rows mode.

It also seems to me that it would be easier to achieve the existing functionality under this proposed paradigm than it would be to get the proposed functionality under the existing paradigm. If facets were applied to records in the way that I am suggesting, it would still be possible to model the existing behavior in most instances by simply replacing "value" in the expressions of each relevant facet with "row.record.cells.{{COLUMN}}.value" (note that this is already the only way to achieve this behavior in rows mode, so it doesn't seem like a huge change to have records mode behave the same way). On the other hand, in order to get the functionality I am suggesting under the current paradigm, a user must combine all of the facets they were using in rows mode into a single unwieldy and() function in the expression of one facet. If the existing workflow was deemed to be common enough that it warranted it, I could even imagine an interface where each facet has a "rows/records" toggle and selecting "records" would automatically replace "value" with "row.record.cells.{{FACETCOLUMN}}.value" behind the scenes.

Examples

I have created a sample dataset to hopefully illustrate some of the issues I currently encounter as a result of the existing facet logic in records mode.

Example 1.

As I mentioned before, having facets applied to records in a row-agnostic fashion leads to instances where a record is selected by a facet that would not have selected any of the rows in that facet in row mode. This means that I often find myself identifying a number of rows that fit the criteria I am looking for, wanting to look at the records that these rows are part of to get more context, and then changing to records mode only to see vastly more records than the ones that my original set of rows were part of. Here is an example using my sample dataset:

In this project, each record represents a species of shark, and each row represents a survey of that species. In rows mode, I have narrowed it down to only show me surveys that were done at the Gulf of California on species from the super-orders Batoidea and Galeomorphii. However, if I want to look at other surveys done on the same species, there is not an easy way to do this. When we go to records mode, there are 25 matching species, even though we only had four matching species before. This is because we are now seeing records that contain surveys from the Gulf of California but that don’t correspond to the super-orders Batoidea and Galeomorphii, and surveys from the super-orders Batoidea and Galeomorphii but not the Gulf of California. (This is actually just because the super-orders are not filled down, so some of the rows that should have the same super-orders are blank, but one can easily imagine how this error could arise in exactly the same way if the super-order column had multiple possible values for each species).

Example 2.

Another issue mentioned above is that, because the facets in records mode look at all the rows in a record to make a judgement, it is possible to have two mutually exclusive facets still come up with results.

Here you can see that I have faceted the Confidence column by blank and selected true and false. If I were in rows mode, this would have yielded no results, since a cell can’t be both blank and non-blank, but here we have results since the Confidence column has both blank and non-blank cells in this record. I feel that the inconsistency between behaviors in rows and records mode in this sort of instance is confusing.

There's an associated issue in the issue tracker #7710 but I suggested Ella kick off a discussion here to invite wider involvement.

The first thing to say is that it's a long standing problem that record mode, while useful, is far from perfect (or complete). I think part of the problem here is that each of these "records" is actually a group of sub-records and OpenRefine doesn't have a good way to represent this. If you import an XML or JSON file, it will kind of pretend that it does by showing colored header bars over what it calls "column groups" internally, but these don't round-trip, are easily lost, and aren't really used for anything. Instead the current model is much more a "bag of values" model where each cell in a record column is considered an unordered value.

If we had a table of books (works) with title, authors, subjects, editions, we know intuitively that the authors column is going to contain all the co-authors and the subjects column is going to contain a list of subjects, but that the two columns are independent of each other and are just associated with the work. On the other hand, we have a few more columns with ISBN, price, format, publication place, publication date, then each of these attributes of an edition go together and it wouldn't make sense to mix and match them. If we're looking for an edition that was published in London before 1950, we can't pick the publication location from one edition and the publication date from another.

If we were to somehow introduce an option to require facets to all apply to the same row of a record, we'd need to figure out what to do about blank values. One possibility might be to treat them as if a virtual Fill Down operation had been done on the column, but it's not clear to me that that would always be the correct solution.

I think today our best solution would probably involve doing a Fill Down on the leading columns and then do the faceting in row mode. If you want to make a sample of the data available for people to play with, they may be able to come up with other ideas.

Is this the original format of the data? If not, what was the original format? It looks like the kind of data that might come from GBIF, but that's not something I use, so I'm not sure what format options they have.

Best,
Tom

Thanks so much for that response. This data is actually originally from a csv I downloaded from sharkipedia.org, but I ended up doing a lot of transformation on it to illustrate my point better. I am happy to share the project file here or find another dataset to look at if that would be helpful.