How much data can OpenRefine handle?

This is a question that comes up regularly, and so it maybe helpful to take a look at this discussion thread from the old OpenRefine Google Group https://groups.google.com/g/openrefine/c/-loChQe4CNg/m/eroRAq9_BwAJ

If your question isn’t answered by that thread, please feel free to ask for more information by replying to this topic!

Newbie question here: I'm applying GREL functions I've copied from members of this group and I keep getting the message about too many choices to display. What are my options for changing all the rows in a given column, say, from all caps to title case?
Thanks.

Hi @Nancy_Sack

The "too many choices" warning is just a warning - you can increase the number (the warning should prompt you to increase it I think - if you click "ok", let me know if not). However if you make this very large you could see some slow performance (for reference, I currently have my maximum facet display set to 23,000, I wouldn't necessarily recommend going that high, but a few thousand should not be a problem)

However, to change all the rows in a given column from all caps to title case is relatively straightforward and doesn't require a facet. To apply a change to a column, you access the dropdown menu at the top of the column and choose Edit cells. In the sub-menu that displays you'll see an option that says Common transformations and then under that To titlecase. You can simply select that option to apply the change

I think you might find a tutorial like the Library Carpentry OpenRefine lesson a useful way to get started - you can work through this by yourself Library Carpentry: OpenRefine: Summary and Setup

The lesson on making "transformations", including the example of changing text to title case is in the 7th episode of the lesson Library Carpentry: OpenRefine: Introduction to Transformations

Hope this is helpful

Owen

PS I'm still writing an answer to your other question about subject headings but have been travelling and preparing for a presentation - I'm hoping I might be able to send you an answer this evening

Just checking that warning you mentioned - if you get a warning like this:

Click the "Set choice count limit" option and you'll see something like:

If the number in the box is large enough (it will always give a number bigger than the list you are currently trying to display by hopping up to the next round thousand), you can click "OK" and the limit will be increased and your facet list should display automatically

Many thanks, Owen, for your informative message. I did go through the excellent Library Carpentry tutorial. The reason I asked about changing all caps titles to title case is because I want to preserve the titles in "cataloger case" (with first word capitalized along with all proper nouns and adjectives).

Thanks in advance for your advice on splitting subject headings. I'm hoping to reconcile the resulting terms to FAST, if that's possible.

Best,
Nancy

Owen, I used your expression:
forEach(value.split(' '),v,if(isNonBlank(v.match(/([^a-z])/)[0]),toTitlecase(v.match(/([^a-z])/)[0]),v)).join(" ")strong text
And then clicked OK but nothing changed.

Hi @Nancy_Sack

I can't remember where I first posted this and what it was intended to do, but as it is written here it looks like this expression will look for any words that are written in ALL uppercase letters, and convert only those words to title case (i.e. so they just start with a capital letter). Words written in any other way will be left alone

forEach(value.split(' '),v,if(isNonBlank(v.match(/([^a-z]*)/)[0]),toTitlecase(v.match(/([^a-z]*)/)[0]),v)).join(" ")

So it would convert:
A CASE OF FAHR'S SYNDROME -> A Case Of Fahr's Syndrome
but convert
A case of Fahr's syndrome -> A case of Fahr's syndrome (i.e. do nothing)

I'm not sure if that's exactly what you are looking for from your description? The challenge of starting with a string like:

A CASE OF FAHR'S SYNDROME

Is that there is no way of knowing in advance there is a proper noun in this string (so we can tell looking at it that Fahr is a proper noun and so should start with a capital, but this is a harder task for a machine).

I think to more accurately covert titles to "cataloguer case" is a hard task because you can't rely on simple rules to know whether to capitalise certain words - in the above example there's no easy way to know that Fahr should be capitalised while case and syndrome should be lower case (and in the case of syndrome I had to look up to check if it would be Fahr's Syndrome or Fahr's syndrome - so not an easy task for a human either!

I suspect there are tools that can do this work and it's probably possible to integrate these with OpenRefine - but off the top of my head there's no easy way to do this. To give an example @michael_markert posted this use of the OpenAI API in another thread Using the OpenAI API to apply natural language queries to cells/data - that's not the specific answer in this case but I suspect that it could be adapted to do the work for you here - but it will require an OpenAI API account and possibly some payment for the service

I want to preserve the titles in "cataloger case" (with first word capitalized along with all proper nouns and adjectives).

Unfortunately, OpenRefine's toTitlecase operation doesn't do this. There's an enhancement request to improve it:
https://github.com/OpenRefine/OpenRefine/issues/2483

Tom