Is there a GREL function to alphabetically compare strings?

I need to make a facet that finds all values that come after a certain word when ordered alphabetically. I thought the relational operators (e.g. > ) might do the trick but it looks like they don't work for strings. I scoured the documentation for a relevant function but haven't found anything. Am I just missing it, or is there no way to do this? The only workaround I can think of is to create a column and fill it, for every entry, with a fixed value ("true", say). Then order the whole table according to the column I want to filter and reorder rows permanently. Then jump through the whole table to find the first entry that is beyond my cutoff value and then 'blank down' in the new column from that entry. Then I can facet on the values in the new column, but this seems a but of a fuss.

Try this as a first step. Then I bet you can figure out the rest that's needed. Remember you can even use GREL for Custom Text Facets.

This falls into "String parsing and splitting". You want to split your Strings after a certain word or phrase.

Try the partition() or rpartition() which outputs an Array. Ensure you select the 2nd part of the Array, where GREL indexing starts from 0 zero.

value.partition('A Certain Word or Phrase')[2]

"Some word to partition".partition('word')[0]
"Some "

"Some word to partition".partition('word')[2]
" to partition"

Docs Reference: GREL functions | OpenRefine

Just trying to understand the request.

Do you mean that if you have a column like:

Apple
Boy
Carrot
Donkey

You want to be able to do a facet like "show me all values in the column that come after "Burst" alphabetically" - i.e. containing "Carrot" and "Donkey" (where "Burst" is a word of your choice)

Exactly that. In my particular case, the strings are names of tape assets so look like "HNDAAAL7" with the three letters before the L7 are incremented when another tape is created (so the next one is HNDAABL7 and the 27th one is HNDABAL7). With a normal text facet I can get a nice alphabetically ordered list in the facet box but I don't want to click on each one in order to select all the ones I want to isolate (there are about 50 out of 200).

Thanks thadguiry but I'm not looking for a positional split within a string. ostephens's example is what I'm after. With a column that has numerical values I can do something like value.toNumber() < threshold, but there seems to be no string equivalent.

It's not very extensible but for your very specific case as you describe it, you could do something like:

with(value+" "+"HNDAB",v,v.split(" ").sort().join(" ")==v)

This makes quite a few assumptions, but given your description I think they are safe in this particular case - essentially that you are working with strings with no spaces in them and all uppercase. The HNDAB string is the value you are comparing against and the outcome is true if the value comes before the test string (based on. the array sort function, which is character value based rather than strictly alphabetical but if we are dealing with uppercase ASCII range we're OK) and false if it comes after

Ah I see, yes this does work, though I don't think we need the split/join construct. This expression :
[value, "HNDABEL7"].sort()[0] == value
evaluates to true for all values 'earlier' than or equal to "HNDABEL7". Maybe not as intuitive or elegant as a straight comparator (consider adding such a function for strings?), but just as effective. Thank you!

I think the documentation could do with a small section on operators. The arithmetic ones are kind of obvious but it's good to highlight the behavior of the / operator for ints, and the % (modulus) operator is worth a mention too. I think it's worth highlighting that some (==, !=, +) work for strings as well. Perhaps some work for dates too?

1 Like

Good catch - had overlooked you can just directly create the array

@jquartel would you be willing to create a github issue for this at Issues · OpenRefine/openrefine.org · GitHub ?

Sure, which one? The lack of documentation on operators or adding a string comparator function, or both?

Either :slight_smile:

But for documentation we need an issue at Issues · OpenRefine/openrefine.org · GitHub (which is what I'd really meant by my original message)

And for a new string comparator function we'd need a feature request at Sign in to GitHub · GitHub

1 Like

Sure, which one? The lack of documentation on operators or adding a string comparator function, or both?

The binary comparison operators should totally support strings (and dates). It's just an oversight that they don't.
I created issue #6340 to remedy the oversight and put up an implementation in #6342 (which also fixes #6341).

For strings, the comparison is done using the current locale's collating sequence with normalized decomposition so that both composed and non-composed versions of É are equal to each other.

Date comparison is actually done using Java's Comparable interface, so any future OpenRefine data types which implement this interface will get support for free.

Tom

1 Like

How does a user control if they truly do want String literal equality checking? My comment is here:

I've submitted a ticket for the documentation (#289), and it looks like @tfmorris has picked up the baton for the overloading of the operators (thank you!). I actually had a different idea for selecting consecutive runs of facet items, so I've added a feature request for that (#6337). You are all so astute and responsive - I'm forever in awe at how well this project is run and supported. Thank you again!

1 Like