Hello, I am a very new OpenRefine user, thanks for your patience.
I'm trying to clean multiple sets of data. I'm trying to apply my saved (long and time consuming) cleaning to another dataset. This worked fine yesterday. But today, when I try to apply the saved history, nothing happens. Like, nothing at all. I saved a copy of the json file as a txt file and tried importing that, and OpenRefine flashed the "working" window for a split second, and then didn't do anything. I tried different saved histories and those didn't work either.
What is going on?? I am about to cry. How do I make it work again?
It's also worth saying that in order for the history to apply the project you are applying to has to have identical column names to your original project - the transformations are applied to named columns, so if the column named in the transformation doesn't exist in your project, the transformation won't run.
The more information you can share about the data your are working with, the structure of the projects and the transformations you are applying, as well as which version of OpenRefine you are using, the more likely it is someone might be able to figure out what's going on
I wanted to reopen this discussion, as I have noticed that this difference in behavior appears to have begun with versions 3.9. I have long used earlier versions at work, and applying a saved JSON operation to a project, even if it did not have every column in the JSON operation, seems to work in OpenRefine 3.8.2. However, I just replaced my laptop that I use when working from home, and installed OpenRefine 3.9.5. I am now experiencing this problem as well. This is a bit frustrating, as my team built workflows around reusing JSON histories in OpenRefine for somewhat messy datasets that we get from multiple providers. These often have inconsistencies in the column names. So I may revert back to using older versions of OpenRefine while we consider what we should do moving forward so that our workflows are not disrupted. It would be helpful to have a setting in place that allowed you to override this behavior ā that is, if applying an operations history via JSON, if OpenRefine encounters an invalid instruction, it could just skip it and move onto the next instruction.
I noticed the same thingāone of my colleagues is using a previous version of OpenRefine because the column names sometimes change in the JSON data weāre cleaning, and our operations stopped working in 3.9+ because of it. I would second an override feature for sureāwe donāt really have an option at this point other than to use an older version or redo the operations every time the json data changes.
Thanks for the report! Are you able to share an example recipe that works in 3.8 but fails in 3.9? Thereās been a lot of work done to make this part of the application more robust, but I wonder if in doing so there was a breaking change. Having a concrete example to highlight how and where this fails would be a huge help.
@Rory the biggest change Iāve noticed is that now (3.9.x) I canāt apply a set of operations unless they are all valid. It used to be the case that invalid operations were just ignored and the rest of the set was applied, but now the presence of an invalid transform leads the whole set to fail. Specifically the issue Iāve noticed is that you can no longer include transformations in the history that operate on columns that are not present in the project.
I understand why this might be a good thing, but itās also sometimes a pain. I have to be a lot more careful that I have everything just right before I apply a transformation
To give a specific example, if I have a set of data transforms that are useful over a set of files, but those files sometimes contain a column and sometimes donāt, then I have to make sure that when I apply the transformations to those lacking the column I first remove all the operations that would act on that column. It used to be the case that I could apply all the transformations to all the files and it wouldnāt matter because the operations on absent columns would just be skipped.
I understand that Iām being saved from potentially doing something stupid here to some extent, but its made certain workflows more painful.
It will be interesting to hear if there are other scenarios where the stricter parsing of the operations JSON is causing an issue but for this particular case potentially a āSkip operations on columns that are not presentā option (or similar) would allow me to make that decision in cases where I know its what I want to do
@Rory I have the exact same issue as @ostephens . I have an order of operations I want to perform on a json file to transform it into a working csv, but because the same columns arenāt always present, it has failed since the 3.9 update (error message: āInvalid JSON format: java.lang.Exception: No column named āxyzāā). My colleague that regularly uses this OpenRefine script to create a working csv from the json file is using an older version of OpenRefine because we havenāt come up with an alternative solution. I can share the OR operations script and a sample json file we perform them on if thatās helpful? I canāt upload a .json or a .txt file here, so if youād like I can email you the files.
I completely agree with Owen and Julia on this. In my case, my organization has complex application profiles for some of our systems, in many cases with hundreds of possible metadata fields. If all metadata fields were used (which never occurs in a single dataset), theoretically we could get a .csv from a submitting repository with a completely unmanageable number of columnsāI would make an analogy to requiring a submitter to include all possible Wikidata properties or every field / subfield in the MARC bibliographic standard in their spreadsheet, even if they are just using a limited profile of properties / metadata fields within these standards. It would be madness to require partners across my institution to submit .csvās containing hundreds or thousands of columns if they only need to use 20 of them to support their project. For about a decade, I had happily maintained a set of transformation operations that work with this reality in OpenRefine, but with the strict validation requirements implemented in OpenRefine 3.9+, I can no longer use these json operation histories to process metadata, and would be forced to create a new set of json operations for each project under the validation routine implemented with OpenRefine 3.9+. A simple solution would be allow an option to ignore invalid operations, as Owen suggests. I donāt think I can attach a .json file to this message, so I have bundled a sample json operation history and a sample .csv with relevant data in to a zip file, attached here.
@Rory the biggest change Iāve noticed is that now (3.9.x) I canāt apply a set of operations unless they are all valid.
I wasn't involved in these changes, but that sounds like exactly what one would want to prevent silent data corruption.
It used to be the case that invalid operations were just ignored and the rest of the set was applied, but now the presence of an invalid transform leads the whole set to fail.
The problem with this scenario is that each operation depends on its predecessors, so once one fails, the project is no longer in the assumed state.
The ability to extract and reapply an Undo History was a quick hack that was thrown together to allow a set of operations to be reapplied to the same file or a file of identical shape. Once you deviate from this, you are in uncharted, and more importantly, untested territory. Unfortunately, missing, along with all the other error checking, are any checks to make sure that these constraints are being followed. The constraints exist only as verbal warnings passed down through the community.
I haven't investigated this yet, but my suspicion is that, if the new error checking is causing you pain, you've probably been unknowingly corrupting your data.
We could probably introduce an option to treat errors as warnings, but removing the error checking altogether and going back to the previous situation seems like a bad idea to me.
As an aside, I'm surprised that automation always rates so low on Martin's surveys if it's being so heavily used.
[...] I have an order of operations I want to perform on a json file to transform it into a working csv, but because the same columns arenāt always present, it has failed since the 3.9 update (error message: āInvalid JSON format: java.lang.Exception: No column named āxyzāā).
My understanding is that the way that this is intended to work is that you should be presented with a column mapping dialog intended to allow you to map the original column names to the column names in the new file.
Snip: you've probably been unknowingly corrupting your data.
This is definitely not the case for me. We run extensive QC after transformations are run, and the systems that we are loading the data into also run validation.
No I have never seen this dialogue. I also donāt think it would be sufficient for the use cases described where the requirement is not āapply these to a differently named columnā but āskip these operations because that column doesnāt existā
I understand this. I also (and I think I was clear about this in my reply) understand why the changes were introduced and Iām not asking for this checking to be removed.
However given this is how OpenRefine has operated for as long as Iāve used it, and Iāve found it occasionally useful (as others have clearly based on this thread), I think itās reasonable to ask for someway to recreate the previous behaviour - whether thatās by an āoverrideā for particular errors in the Apply history, or by some other mechanism.
I completely support Owen here. I would also push back against the idea that this feature of OpenRefine should be performing some kind of shape or schema validation (against what, exactly?). It is perfectly standard practice when creating a metadata schema to designate some elements as optional / not required. In these instances, datasets that do not contain these elements would still be completely valid. However, IF optional elements (or Columns) are present, you would like relevant operation to be performed.
Nope. Iāve never seen this dialog. But the issue isnāt that the column has a new nameāitās that it doesnāt exist. The json file being generated only includes that metadata field if it exists, and thereās no field if it doesnāt.
Certainly not. 3.8 still produces a file that is correctly performing all the operations on the columns that do exist, and ignoring the ones that donāt. We use these generated csv files to manually check our authority data, and most certainly would have discovered corrupted data in the past 3+ years weāve been doing this if that were the case. A human is reviewing these files on a regular basis.
No I have never seen this dialogue. I also donāt think it would be sufficient for the use cases described where the requirement is not āapply these to a differently named columnā but āskip these operations because that column doesnāt existā
@timothy-mendenhall , thank you very much for that sample project and history (coincidentally, I think it also uncovered a UI bug with the column mapping dialog @tfmorris mentioned).
I think itās worth investigating what a resolution to this looks like. The ideal scenario is likely complicated and would take time to properly design, but I think a reasonable first step would be to provide users with a warning that not all operations are valid. Said warning could have ācancelā and ācontinueā options so users have the option of getting something close to the old behavior while still offering some protection. If that sounds like an acceptable way forward, I can write up an issue to track this.