Returning record IDs when performing cross lookups

I’ve ran into a surprising issue when using the cell.cross GREL function within the same project and trying to extract the record index of the returned rows. Instead of distinct indexes, I get the same value repeated. When trying to extract the row values, however, I do get distinct values as expected. Is this a bug or a feature?

Sorry for the convoluted explanation, I’m not sure how else to explain this than by going through my use case.

My data looks a bit like this (simplified version of my real data):

001 035$a 245$a 020$a
7612. 53773 (CaOOAMICUS)000034856822 Why I love my mummy / 0007270208
9917. 53806 (CaOOAMICUS)000034856822 Why I love my daddy / 0007877617
12128. 249134 (OCoLC)ocn695560625 Culinary travels. B0047VZTDG
14579. 288353 (OCoLC)ocn695560625 Culinary travels. B0047VZUF8
17241. 509726 0000042320 The Umbrella Academy : 9781593079789
0000042320
(OCoLC)242629582

The first column is the internal OpenRefine record index (next to the star and flag options). As you can see, I have some multi-row records due to multiple values in the 035$a column.

What I’m trying to do is to find duplicated records that have the same 035$a value across different records, but not records that have duplicate 035$a values in the same record. So in my example above, the two first pairs are duplicates, but not the last one, where a value is repeated in 035$a but it’s in the same record.

To do so, I’ve been adding a column based on the 035$a column by using GREL. I’m able to successfully identify duplicated values, but it’s when I try to compare the record index of the found duplicates (to eliminate duplicates within a record) that I run into my issue. A way to illustrate this is to try to use GREL on the 035$a column to return some values from the found duplicates in a new column.

If I use

with(cell.cross("myprojectname","035$a"),dup,if(dup.length()>1,dup.cells["001"].value,""))

I am returned an array with the values in my 001 column (which are unique identifiers), something like

row value with(cell.cross("…
7718. (CaOOAMICUS)000034856822 [ “53773”, “53806” ]
10065. (CaOOAMICUS)000034856822 [ “53773”, “53806” ]
12974. (OCoLC)ocn695560625 [ “249134”, “288353” ]
17144. (OCoLC)ocn695560625 [ “249134”, “288353” ]
21312. 0000042320 [ 509726, null ]
21313. 0000042320 [ 509726, null ]

… which is to be expected.

However, when trying to use the record index instead

with(cell.cross("myprojectname","035$a"),dup,if(dup.length()>1,dup.record.index,""))

I get the following

row value with(cell.cross("…
7718. (CaOOAMICUS)000034856822 [ 7611, 7611 ]
10065. (CaOOAMICUS)000034856822 [ 9916, 9916 ]
12974. (OCoLC)ocn695560625 [ 12127, 12127 ]
17144. (OCoLC)ocn695560625 [ 14578, 14578 ]
21312. 0000042320 [ 17240, 17240 ]
21313. 0000042320 [ 17240, 17240 ]

… which I can’t explain. Thanks to the first formula, I’m reasonably confident my dup variable contains indeed the multiple records in which the same value is found, and I’m able to extract correctly the content of the 001 column for those records. But why when I try to retrieve the record index instead do I get the same value twice, when I’m sure the dup variable should point to two different records?

Instead, I would have expected the following, which would have been consistent with the first method.

row value with(cell.cross("…
7718. (CaOOAMICUS)000034856822 [ 7611, 9916 ]
10065. (CaOOAMICUS)000034856822 [ 7611, 9916 ]
12974. (OCoLC)ocn695560625 [ 12127, 14578 ]
17144. (OCoLC)ocn695560625 [ 12127, 14578 ]
21312. 0000042320 [ 17240, null ]
21313. 0000042320 [ 17240, null ]

I’ve tried to search multiple places for an explanation of this, but haven’t found anything so far. Am I fundamentally misunderstanding how the record index functions, perhaps?

Many thanks, and apologies in advance if this is not the right place to be asking such questions…

Forgot to add that I’ve experienced this both on version 3.6.2 and on the recently released 3.7.0.

I’ve ran into a surprising issue when using the cell.cross GREL function within the same project and trying to extract the record index of the returned rows. Instead of distinct indexes, I get the same value repeated. When trying to extract the row values, however, I do get distinct values as expected. Is this a bug or a feature?

Feature, but a somewhat confusing one. You can find more here: https://openrefine.org/docs/manual/exploring#rows-vs-records

A record is made up of one or more rows. All rows in the same record have the same record ID. By toggling between Record & Row mode in the UI, you can see the two different views of the data. The confusing part is that OpenRefine is 95% row-based, so how records operate can be kind of obscure.

Out of curiosity, why wouldn't you eliminate the duplicate 35$a identifiers within the same record? It seems like it would make your matching task easier.

Tom

Actually, I think I replied too quickly and there may be some other stuff going on here, but it’s hard to tell without the data to duplicate it with. Do you have a minimal dataset which will reproduce the problem with the code that you’ve given?

Tom

Thanks @tfmorris for looking into this. Indeed, I would expect multiple rows in a single record to have the same record index, but in my example, I’m also getting the same record index for rows coming from different records. I’m beginning to wonder if the cell.cross function maybe does a temporary binding of the returned rows to the record it’s being invoked from, which may explain the results I’m seeing.

You should be able to replicate the steps above with this data subset:
openrefine-test-data.csv (3.5 KB)

Out of curiosity, why wouldn’t you eliminate the duplicate 35$a identifiers within the same record? It seems like it would make your matching task easier.

Indeed, but I need to identify duplicates without changing the data... and would also like to develop a function I can reuse without having to remove such duplicates first.

Thanks for the reproducer data. I was actually able to reproduce things with the data from your email which gives nice single digit row and record numbers. I also simplified your expression to just

cell.cross("","035$a").record.index

to make the debug easier.

Thanks @tfmorris for looking into this. Indeed, I would expect multiple rows in a single record to have the same record index, but in my example, I’m also getting the same record index for rows coming from different records. I’m beginning to wonder if the cell.cross function maybe does a temporary binding of the returned rows to the record it’s being invoked from, which may explain the results I’m seeing.

Yes, it's definitely a bug, and you're not far off from the root cause. It's actually the .record field lookup which is wrong for arrays rows. It's using a global context instead of the data specific to the row. I have a fix in hand which just needs a little more testing.

Tom

2 Likes

Thanks for the quick turnaround and for opening the issue! Happy to help with the testing if needed.