Playing a bit with ChatGPT to make my life easier - I think it can help (?) users generate regex, GREL...?

Posting this here because I think this is an area where AI (around which I’m very cautious) can potentially be of great help to OpenRefine users.

One of the harder things for me (and the people I teach OpenRefine): finding appropriate pieces of scripts and expressions to do the things I need to do. Yes, you can search the web, and it may be in the documentation, but all that looking up is time consuming, can be frustrating if your exact case is not described elsewhere, the documentation you find may be difficult to understand (if you are not a coder), and the process of looking up things takes you out of the flow you were in during the task at hand.

I don’t master regex (I kind of stubbornly refuse to learn it, frankly, because most people I teach/help also don’t know it and their priorities are elsewhere). I’ve been using https://www.autoregex.xyz/ occasionally as this transforms natural language to regex, but this service is spotty.

I have just tried ChatGPT to do the same things for me. I don’t have a concrete project at the moment with which I can test if it works, but I tried the following and I’ll leave it up to the experts to check if the output is correct :disguised_face:

Me: Give me a correct regular expression for “the last comma in this series of characters”

ChatGPT: This regular expression matches the last comma in a string of characters:

/,(?!.*,)/

The (?!.*,) part is a negative lookahead that asserts that there is no other comma ahead in the string.

Screenshot:

Me: Give me a correct regular expression to retrieve a series of four numbers from a string of characters. The numbers are between rounded brackets, starting with the number 1

ChatGPT: This regular expression matches a series of four numbers between rounded brackets, starting with the number 1:

/\(1[0-9]{3}\)/

And here comes the banger… :exploding_head:

Me: Create a piece of GREL code for OpenRefine in which I shorten the values in a column by removing the first 7 characters in each cell.

ChatGPT: Here is the GREL (Google Refine Expression Language) code to remove the first 7 characters from each cell in a column in OpenRefine:

value.substring(7)

You can use this expression in OpenRefine by selecting the column you want to modify, clicking on the “Edit cells” button, and selecting “Transform” from the drop-down menu. Then, you can paste the expression into the “Expression” field and click on “OK”. The changes will be applied to all cells in the column, removing the first 7 characters from each.

Screenshot:

I’ll try this kind of stuff ‘for real’ in a next project. For now, just putting this here :slight_smile:

7 Likes

This is very interesting. I was just having a conversation with someone who came to a presentation I did on OpenRefine this last Saturday. His main concern was actually having to deal with regular expressions. I’ll consider ChatGPT next time I have a doubt on regex.

4 Likes

Interesting and useful experiments. May I request you all to post similar experiments with ChaGPT or APIs of OpenAI?

We have recently created scope notes for undefined terms of the Homosaurus vocabulary by using OpenAI prompts.

3 Likes

Interesting use case! See Using the OpenAI API to apply natural language queries to cells/data for the original post on the code.

3 Likes

Maybe train an AI on semiring algebra so it can just do the right thing. Present with a corpus of loosely structured input, highlight features of interest, and construct a transducer (automaton) to locate and extract said features to an output sink. Gold[1963] said this is impossible, but maybe we can get close enough. After all, Godel said arithmetic is impossible.

From https://en.cppreference.com/w/cpp/regex:
image

Folks avoid the conventional regex constructs mainly, I think, because of the way they are presented. Most of the available tools (grep, awk, sed, perl, etc) are essentially derivatives of the original work done by Thompson et al back in the 70s and they haven’t evolved much since then. They present cramped and limited means for expressing patterns and minimal support for interacting with identifiable features.

Semiring algebra is clean, stable, and robust.