Hints, Tips and Tricks

I thought it might be fun and interesting to post some OpenRefine hint/tip/tricks. I’m going to try to post one every week on a Friday - at least for as long as I can keep it going… so here goes with my first one:

The GREL length() function can be used to check the length of an array. It’s surprisingly useful in a wide variety of scenarios. For example:
value.find("string").length()
or
value.find(/regular expression/).length()
can be used in a custom text facet to see how often a particular string or regular expression appears in each cell.

One of the scenarios I use it often is to check for matches in another project using:
cell.cross("project name","column name").length()

Because the cross() function returns an array of rows from the named project which have a match, the length() of the array is the number of matches - you can use this to find situations where there are either more or fewer matches than expected.

If you have your own hints/tips/tricks that you use and feel others would find useful - please feel free to share them in this topic!

3 Likes

Nice post @ostephens and hello! I recently used VLOOKUP in Google Sheets to compare values in one sheet’s column with values in another sheet’s column and indicate if there is a match. It sounds like a combination of what you provided will work in the same fashion? I found this post from @Martin I believe vlookup in google refine ~ RefinePro Knowledge Base for OpenRefine. I did this on DOIs pulling from two different databases. Also curious if you can merge the two export files based on this condition, so you have a CSV export from database 1 that has a column with DOIs and you have the same from database 2. I did this work outside of OpenRefine in Google Sheets but it would be nice to know if you can do it within OpenRefine.

2 Likes

Yes cross is very similar to. VLOOKUP. You can also use 'cross' within a single project - you just need to put in the current project name in the cross function arguments.

Yes - at least to a degree.
The 'cross' function returns an array of rows from the named project. It's an array because there could be more than one match. You can extract any cell from the rows returned. If we imagine that you had a table like:

Row No DOI Title Author
1 10.1234/5 A guide to OpenRefine Stephens, O
2 10.1234/6 A guide to DOIs Stephens, O

If you were to do a cross() based on the DOI 10.1234/5, you'd get back an array like:
[ row1 ]
but if you were to do a cross() based on the Author name "Stephens, O" you'd get an array like:
[ row1 , row2 ]

So for example we can do:
cell.cross("DOI project", "DOI")[0] <---- this the first matching row from the project

Since a row is a collection of cells in columns with values, we can then extract the value from a specific cell in the row:
cell.cross("DOI project", "DOI")[0].cells["Title"].value <----- this is the Title from the first matching row from the project

or

forEach(cell.cross("DOI project", "Author"),r,r.cells["DOI"].value) <---- the forEach iterates across all matching rows and gets the DOI in each case - the result is an array of DOIs

This gives a way of getting back specific data from a row:
row.cells["column name"].value
But if you want all the values from a row there's another trick which is to make use of the fact that a row contains not just all the cells in the row, but also the column names which you can then iterate through like:
forEach(row.columnNames, cn, row.cells[cn].value) <---- an array containing all the values from the row

So in the above example you can do something like:
with(cell.cross("DOI project", "DOI")[0],r,forEach(r.columnNames,cn,r.cells[cn].value)) <---- array of all the values from the first matching row

This returns an array - so you'd have to join together to get a value to store in a cell. But once that is done, you can get a new cell containing the full data from the row from the other project - e.g.
with(cell.cross("DOI project", "DOI")[0],r,forEach(r.columnNames,cn,r.cells[cn].value)).join("|||")

This can be then split using the Edit Columns -> Split into several columns ...
And this way you can get all of the data in a single project!

The "cross" function, the use of arrays and the manipulation of rows are definitely in the set of less intuitive areas of OpenRefine but once you get familiar with them they can be pretty powerful.

1 Like

It’s Friday, so time for another hint/tip/trick from me.
Today I’m going for a very straightforward piece of functionality but one that I use regularly: Using Flags and Stars to keep track of particular rows.

You can “Star” or “Flag” rows in OpenRefine by clicking the star/flag icon in the “All” column. The star and flag don’t have any special meaning - they are just two ways of marking particular rows. Once you have some starred/flagged rows you can use “Facet by star” and “Facet by flag” (in the All column drop down menu) to find all the rows you’ve starred/flagged (or those you haven’t). I often use the ability to flag/star rows manually to handpick a set of rows I want to work with (for example get rid of some extraneous rows like header information I didn’t mean to import in the first place)

But typically I use the “Flag rows” and “Star rows” tools (in the All → Edit rows menu) to mark larger groups of rows I want to work on. I find this especially useful if I’ve identified some patterns of data that need fixing but the fix takes a few steps, and one of those steps breaks the filter I’ve setup. Instead of relying on filtering by a pattern in the data while the data changes (as I fix the problem), I can use the pattern to isolate the rows (e.g. with facets or text filters), use the “All” dropdown menu “Flag rows” to add a flag to all these rows, and then use the Facet by flag option in place of my original filters.

Using flags and stars together can also be a nice way of tracking rows that fulfil two different sets of complex criteria - rather than having to maintain a complex set of potentially overlapping filters/facets you can set up your first set of critera, and star all the matching rows, then clear everything and apply your second set of criteria and flag all the the matching rows. All the rows that matched both sets of criteria are now both flagged and starred and can be found using the flag/star facets together.

Do you use Flags and Stars? If so any particular tasks or ways in which you find them useful?

2 Likes

I missed posting in this thread for the last two weeks (and unlikely to post next week as I shall be enjoying a holiday) but I’ve just seen a hint from @Antoine2711 on Github that is too good not to post here.

Did you know you can get data from a different row in your project by using the following GREL:
cross(<row index>).cells["<column name>"].value

So for example if we have a project where the values in one column have become offset by 1

City Country
London
Paris United Kingdom
Amsterdam France
Berlin The Netherlands
Germany

You can fix it using the following transformation on the Country column:
cross(rowIndex+1).cells.Country.value[0]
(the [0] is required because cross returns an array but in this case it can only have a single value)

City Country
London United Kingdom
Paris France
Amsterdam The Netherlands
Berlin Germany

There are other uses for this as well, but this is a brilliant trick - thanks @Antoine2711

3 Likes

I was just asking about this kind of strategies on Best strategy on wikidata upload. I ended up using flags and star for that. It’s a bit tricky at first, but with some practice and by working step by step, it is very useful. Thanks for this.

1 Like