Is there a way to extract the difference between two cells with space separated data. for example, col_a contains 101 102 103 104 401 and col_b contains 101 102 104. i need col_c to output 103 401
Thanks so much in advance!
Is there a way to extract the difference between two cells with space separated data. for example, col_a contains 101 102 103 104 401 and col_b contains 101 102 104. i need col_c to output 103 401
Thanks so much in advance!
I would do that in python (jython).
Use the « transform… » operation thru the « Edit column » submenu of the second column menu:
in transform field, you can refer to the cells of the selected column with value
, and the value of the other column with this syntax: cells["Other column name"].value
(at least, that’s the grel syntax, not sure of the python/jython syntax).
Regards, Antoine
thanks for your reply @Antoine2711 . I know the syntax to refer to values of a cell though. It's comparing the velues between col_a and col_b and getting the unmatched values to display in col_c I'm having trouble with But thanks for responding to my question.
Can you code in python?
I asked ChatPGT : « Write me code in python to transform these two strings: « 101 102 103 104 401 » and « 101 102 104 » into arrays, and extracts the numbers that are unique in both arrays. »
He kindly answered this:
# Original strings
str1 = "101 102 103 104 401"
str2 = "101 102 104"
# Convert strings to arrays of integers
arr1 = list(map(int, str1.split()))
arr2 = list(map(int, str2.split()))
# Find unique numbers in both arrays
unique_numbers = list(set(arr1).symmetric_difference(set(arr2)))
print("Unique numbers:", unique_numbers)
He also said other things, but to cut short, I didn’t include it here.
I would suggest you to try.
Let the machine work for you! And then, adjust.
Regards, Antoine
Hi @vdmarc and welcome to OpenRefine!
As @Antoine2711 said, we don't have a diff function built-in to GREL.
@ostephens is familar with our new-ish GREL inArray()
function which might be useful here, I think. I cannot find the Recipe for the nice forEach() loops he once did as an example for this kind of usecase, and I tried quickly but failed, but hopefully he responds to this request.
I do sort of what you're asking for a lot with sentence structures, but it's a strict diff, in that it returns the substring after where the 2 strings differ, as you'll see, without concerning itself with separators. But still likely useful to document for you and perhaps others, just in case.
You can access many of the built-in Java libraries we already provide with OpenRefine, such as Apache Commons Lang, which does happen to have a StringUtils/difference function. To use it in OpenRefine, it looks like this for your usecase which you can copy/paste into the expression editor:
(import [org.apache.commons.lang StringUtils])
#_ takes 2 strings as input ex. "StringUtils/difference(String1 String2)", and returns the portion after where they first differ.
(StringUtils/difference
#_ 1st string, our original value in the column we're running this recipe on (your col_b value - in this case it's an easy expression, just "value")
value
#_ 2nd string, this one is more complicated because we don't have good properties exposed for Clojure, but anyways. So this is a long expression that just gets the value from a different cell in the row by a column name, in your case, col_a's value
(.value (.cell (.getField cells "col_a" (new java.util.Properties))))
)
If you want to geek out, here's some recipes:
and our docs:
Jython & Clojure | OpenRefine
OK... spent more time and another alternative.
Use this snippet with the "Add column based on..." onto your col_b
and it will do diff the values to col_a
and output a string that has element differences only.
(require '[clojure.set :as set]
'[clojure.string :as string])
(let [diffset (set/difference
(set (string/split (.-value (.-cell (.getField cells "col_a" (new java.util.Properties)))) #" "))
(set (string/split value #" ")))]
(string/join " " diffset))
if you change your lookup column to be different than "col_a", then just change that part in the Clojure recipe on line 5.
This returns a string of only the elements that are different, as you requested!
(Jeesh, we need macros in OpenRefine - like yesterday!)
thank you, @Antoine2711! I will give this a try
Thanks so much @thadguidry, will try this as well
Hi, @thadguidry!
Great! Glad that you were able to make it work. Too bad mine is giving an error...
I might have copied incorrectly.. am still checking. will let you know if I'm able to fix it Thanks again!
I updated the code snippet. (Although it was correct, sometimes OpenRefine Expression editor misses a few things in the syntax, and we still don't know why this happens. For Clojure and Python, it usually just means fixing indents and ensuring they are done only with spaces and not tabs. But Clojure doesn't use indents or looks at them like Python, it uses blocks of paratheses since it's based on Lisp syntax)
With Clojure and Python, I usually use my IDE or a smart online editor for those languages, and then paste the code into OpenRefine.
Anyways... Copy/Paste the above updated snippet again.
Thank you, @thadguidry ! I was finally able to make it work!
I had to manually type the code and use 8 spaces for the indents and 1 space indent for the last line
Thanks so much!
You're Welcome!
Eventually we'd like to fix that issue. Maybe I'll work on that next month.
Is there a way to extract the difference between two cells with space separated data. for example, col_a contains 101 102 103 104 401 and col_b contains 101 102 104. i need col_c to output 103 401
You can do this directly in GREL without resorting to other expression languages:
filter("101 102 103 104 401".split(" "),i,not(inArray("101 102 104".split(' '), i)))
or, assuming you are doing Add column based on this column using col_a
filter(value.split(" "),i,not(inArray(cells["col_b"].value.split(' '), i)))
Tom
Thank you @tfmorris , this worked too! I only had to add .join(" ") to the expression to output the results:
filter(value.split(" "),i,not(inArray(cells["col_b"].value.split(' '), i))).join(" ")
Thank you, everyone, for all your help!