I’ve ran into a surprising issue when using the cell.cross
GREL function within the same project and trying to extract the record index of the returned rows. Instead of distinct indexes, I get the same value repeated. When trying to extract the row values, however, I do get distinct values as expected. Is this a bug or a feature?
Sorry for the convoluted explanation, I’m not sure how else to explain this than by going through my use case.
My data looks a bit like this (simplified version of my real data):
001 | 035$a | 245$a | 020$a | |
---|---|---|---|---|
7612. | 53773 | (CaOOAMICUS)000034856822 | Why I love my mummy / | 0007270208 |
9917. | 53806 | (CaOOAMICUS)000034856822 | Why I love my daddy / | 0007877617 |
12128. | 249134 | (OCoLC)ocn695560625 | Culinary travels. | B0047VZTDG |
14579. | 288353 | (OCoLC)ocn695560625 | Culinary travels. | B0047VZUF8 |
17241. | 509726 | 0000042320 | The Umbrella Academy : | 9781593079789 |
0000042320 | ||||
(OCoLC)242629582 |
The first column is the internal OpenRefine record index (next to the star and flag options). As you can see, I have some multi-row records due to multiple values in the 035$a column.
What I’m trying to do is to find duplicated records that have the same 035$a value across different records, but not records that have duplicate 035$a values in the same record. So in my example above, the two first pairs are duplicates, but not the last one, where a value is repeated in 035$a but it’s in the same record.
To do so, I’ve been adding a column based on the 035$a column by using GREL. I’m able to successfully identify duplicated values, but it’s when I try to compare the record index of the found duplicates (to eliminate duplicates within a record) that I run into my issue. A way to illustrate this is to try to use GREL on the 035$a column to return some values from the found duplicates in a new column.
If I use
with(cell.cross("myprojectname","035$a"),dup,if(dup.length()>1,dup.cells["001"].value,""))
I am returned an array with the values in my 001 column (which are unique identifiers), something like
row | value | with(cell.cross("… |
---|---|---|
7718. | (CaOOAMICUS)000034856822 | [ “53773”, “53806” ] |
10065. | (CaOOAMICUS)000034856822 | [ “53773”, “53806” ] |
12974. | (OCoLC)ocn695560625 | [ “249134”, “288353” ] |
17144. | (OCoLC)ocn695560625 | [ “249134”, “288353” ] |
21312. | 0000042320 | [ 509726, null ] |
21313. | 0000042320 | [ 509726, null ] |
… which is to be expected.
However, when trying to use the record index instead
with(cell.cross("myprojectname","035$a"),dup,if(dup.length()>1,dup.record.index,""))
I get the following
row | value | with(cell.cross("… |
---|---|---|
7718. | (CaOOAMICUS)000034856822 | [ 7611, 7611 ] |
10065. | (CaOOAMICUS)000034856822 | [ 9916, 9916 ] |
12974. | (OCoLC)ocn695560625 | [ 12127, 12127 ] |
17144. | (OCoLC)ocn695560625 | [ 14578, 14578 ] |
21312. | 0000042320 | [ 17240, 17240 ] |
21313. | 0000042320 | [ 17240, 17240 ] |
… which I can’t explain. Thanks to the first formula, I’m reasonably confident my dup
variable contains indeed the multiple records in which the same value is found, and I’m able to extract correctly the content of the 001 column for those records. But why when I try to retrieve the record index instead do I get the same value twice, when I’m sure the dup
variable should point to two different records?
Instead, I would have expected the following, which would have been consistent with the first method.
row | value | with(cell.cross("… |
---|---|---|
7718. | (CaOOAMICUS)000034856822 | [ 7611, 9916 ] |
10065. | (CaOOAMICUS)000034856822 | [ 7611, 9916 ] |
12974. | (OCoLC)ocn695560625 | [ 12127, 14578 ] |
17144. | (OCoLC)ocn695560625 | [ 12127, 14578 ] |
21312. | 0000042320 | [ 17240, null ] |
21313. | 0000042320 | [ 17240, null ] |
I’ve tried to search multiple places for an explanation of this, but haven’t found anything so far. Am I fundamentally misunderstanding how the record index functions, perhaps?
Many thanks, and apologies in advance if this is not the right place to be asking such questions…