Is there a GREL function to alphabetically compare strings?

jquartel · January 30, 2024, 12:18pm

I need to make a facet that finds all values that come after a certain word when ordered alphabetically. I thought the relational operators (e.g. > ) might do the trick but it looks like they don't work for strings. I scoured the documentation for a relevant function but haven't found anything. Am I just missing it, or is there no way to do this? The only workaround I can think of is to create a column and fill it, for every entry, with a fixed value ("true", say). Then order the whole table according to the column I want to filter and reorder rows permanently. Then jump through the whole table to find the first entry that is beyond my cutoff value and then 'blank down' in the new column from that entry. Then I can facet on the values in the new column, but this seems a but of a fuss.

thadguidry · January 30, 2024, 12:39pm

Try this as a first step. Then I bet you can figure out the rest that's needed. Remember you can even use GREL for Custom Text Facets.

This falls into "String parsing and splitting". You want to split your Strings after a certain word or phrase.

Try the partition() or rpartition() which outputs an Array. Ensure you select the 2nd part of the Array, where GREL indexing starts from 0 zero.

value.partition('A Certain Word or Phrase')[2]

"Some word to partition".partition('word')[0]
"Some "

"Some word to partition".partition('word')[2]
" to partition"

Docs Reference: GREL functions | OpenRefine

ostephens · January 30, 2024, 12:45pm

Just trying to understand the request.

Do you mean that if you have a column like:

Apple
Boy
Carrot
Donkey

You want to be able to do a facet like "show me all values in the column that come after "Burst" alphabetically" - i.e. containing "Carrot" and "Donkey" (where "Burst" is a word of your choice)

jquartel · January 30, 2024, 1:27pm

Exactly that. In my particular case, the strings are names of tape assets so look like "HNDAAAL7" with the three letters before the L7 are incremented when another tape is created (so the next one is HNDAABL7 and the 27th one is HNDABAL7). With a normal text facet I can get a nice alphabetically ordered list in the facet box but I don't want to click on each one in order to select all the ones I want to isolate (there are about 50 out of 200).

jquartel · January 30, 2024, 1:32pm

Thanks thadguiry but I'm not looking for a positional split within a string. ostephens's example is what I'm after. With a column that has numerical values I can do something like value.toNumber() < threshold, but there seems to be no string equivalent.

ostephens · January 30, 2024, 1:47pm

It's not very extensible but for your very specific case as you describe it, you could do something like:

with(value+" "+"HNDAB",v,v.split(" ").sort().join(" ")==v)

This makes quite a few assumptions, but given your description I think they are safe in this particular case - essentially that you are working with strings with no spaces in them and all uppercase. The HNDAB string is the value you are comparing against and the outcome is true if the value comes before the test string (based on. the array sort function, which is character value based rather than strictly alphabetical but if we are dealing with uppercase ASCII range we're OK) and false if it comes after

jquartel · January 30, 2024, 2:36pm

Ah I see, yes this does work, though I don't think we need the split/join construct. This expression :
[value, "HNDABEL7"].sort()[0] == value
evaluates to true for all values 'earlier' than or equal to "HNDABEL7". Maybe not as intuitive or elegant as a straight comparator (consider adding such a function for strings?), but just as effective. Thank you!

jquartel · January 31, 2024, 9:44am

I think the documentation could do with a small section on operators. The arithmetic ones are kind of obvious but it's good to highlight the behavior of the / operator for ints, and the % (modulus) operator is worth a mention too. I think it's worth highlighting that some (==, !=, +) work for strings as well. Perhaps some work for dates too?

ostephens · January 31, 2024, 10:25am

Good catch - had overlooked you can just directly create the array

ostephens · January 31, 2024, 10:26am

@jquartel would you be willing to create a github issue for this at Issues · OpenRefine/openrefine.org · GitHub ?

jquartel · January 31, 2024, 11:25am

Sure, which one? The lack of documentation on operators or adding a string comparator function, or both?

ostephens · January 31, 2024, 11:39am

Either

But for documentation we need an issue at Issues · OpenRefine/openrefine.org · GitHub (which is what I'd really meant by my original message)

And for a new string comparator function we'd need a feature request at Sign in to GitHub · GitHub

tfmorris · February 2, 2024, 6:05pm

Sure, which one? The lack of documentation on operators or adding a string comparator function, or both?

The binary comparison operators should totally support strings (and dates). It's just an oversight that they don't.
I created issue #6340 to remedy the oversight and put up an implementation in #6342 (which also fixes #6341).

For strings, the comparison is done using the current locale's collating sequence with normalized decomposition so that both composed and non-composed versions of É are equal to each other.

Date comparison is actually done using Java's Comparable interface, so any future OpenRefine data types which implement this interface will get support for free.

Tom

thadguidry · February 3, 2024, 2:31am

How does a user control if they truly do want String literal equality checking? My comment is here:

github.com/OpenRefine/OpenRefine

Improve GREL operators. Fixes #6341 Closes #6340

OpenRefine:master ← tfmorris:6340-grel-operators

opened 05:58PM - 02 Feb 24 UTC

tfmorris

+108 -15

Fixes #6341. Closes #6340 - restrict string concatenation (+) to cases where …at least one operand is a string (to avoid things like concatenating two dates as strings) - Add support for dates and strings to comparison operators Strings use a collator for the default locale with normalized decomposition for the comparisons. - Above implementation uses Comparable, so any future data types that implement that interface should get supported for free - Add a bunch more tests

jquartel · February 3, 2024, 4:23pm

I've submitted a ticket for the documentation (#289), and it looks like @tfmorris has picked up the baton for the overloading of the operators (thank you!). I actually had a different idea for selecting consecutive runs of facet items, so I've added a feature request for that (#6337). You are all so astute and responsive - I'm forever in awe at how well this project is run and supported. Thank you again!

Topic		Replies	Views
GREL for exact word matching between columns Data cleaning and transformations	1	286	November 8, 2023
Reordering text within a cell Data cleaning and transformations	2	204	April 1, 2024
Text Facet (not alphabetical order) Support and Helpdesk	7	266	February 2, 2024
GREL function previews ok, then fails Data cleaning and transformations	4	379	April 27, 2023
Possible error in GREL documentation for cell reference Support and Helpdesk	6	129	June 5, 2024

Is there a GREL function to alphabetically compare strings?

Related topics