Reconciling arrays

Hi. I'm trying to figure out an efficient way of reconciling arrays. Reconciling the main entity in a dataset works fine, but I have additional properties that I want to reconcile that are stored in arrays/lists. These properties are also names of entities. Example can the main entity could be a company and it could have an array with a list of names of persons (employees).

To reconcile the arrays, I have used "Edit cells" -> "Split multi-valued cells..." to turn the arrays into rows. This works, but this quickly turns into millions of largely duplicate rows, leading to performance issues, where it tries to reconcile the same entity numerous times.

I have tried to use facets to reconcile, thinking maybe I could reconcile against the facet, where it would group all the identical names together, but that leads to 1. The error message (x choices total, too many to display), 2. I haven't found a way to reconcile the facet.

For each array of persons or whatever, I'm thinking the way is perhaps to extract this to a separate dataset where I can then merge the duplicate rows and then do the reconciliation, but then I would need to merge that reconciled dataset into the array of the original dataset somehow.

I figured out how to do it.

If the original dataset is dataset A, then first reconcile dataset A. Next split each array from dataset A into new datasets B1, B2, ..., Bn. Where each dataset Bx has an array of ids from dataset A, to keep track of all entities in A that should have the entity Bx in an array. Then deduplicate and reconcile each dataset B1,...,Bn and finally merge data reconciled datasets back into arrays in dataset A.

It seems that while splitting arrays into separate datasets in OpenRefine is possible, I haven't been able to find a way to do it that's not a big mess and not very sensible, so it seems much easier to do in code.

If someone has a suggestion for a good way to split arrays into datasets using OpenRefine itself that would be welcome.

An example:


title      |    starring
Jaws     |   ['Roy Scheider','Robert Shaw','Richard Dreyfuss','Lorraine Gary','Murray Hamilton']
E.T       |   ['Dee Wallace','Henry Thomas','Peter Coyote','Robert MacNaughton','Drew Barrymore']
Nuts     |   ['Barbra Streisand','Richard Dreyfuss','Robert Webber','James Whitmore','Karl Malden']
#---------

This should then be extracted into another dataset actors for reconciling the actors. This dataset should be contain all unique names of actors and a list of movies they are starring in, similar to this. The list of movies they are starring in should contain an id for the movie, so it can refer correctly to the original dataset.


name                      |   starring in
Robert Shaw           |   ['Jaws']
Richard Dreyfuss   |   ['Jaws', 'Nuts']
Dee Wallace           |    ['E.T']
#---------

There are definitely a few ways of approaching this. One possibility:

  1. Split the 'starring' column into individual cells using "Split multi-valued cells"
  2. Use "Fill down" the 'title' column so each actor name has the file title with it
  3. Sort by the 'starring' column so that the actor names are in order and apply the sort permanently
  4. Move the 'starring' column to the start of the project so it is the first column
  5. Use "Blank down" the 'starring' column so each name appears only once
  6. Switch to "record" mode and then use "Join multi-valued cells" on the 'title' column
2 Likes

Following, because I have a similar use-case. @matssk, I don't suppose you'd be willing to share the code you're using for the split and combine? I was planning to work on this next week, but it'd be great to adapt something that already works rather than start from scratch.

Owen, do you think It would be nice to have an option for "Split multi-valued cells into Records", that would just do those steps?

Yes - it's a sort of transpose operation I think but not one we currently support. I think the label "Split multi-valued cells into Records" isn't quite right because splitting multi-valued cells always creates records.

Maybe we need an option to Transpose Records to swap values in one col to be the record key? This would mean the user had to create the records first by splitting the multi-valued cells but as a generic function starting from a record (which might be created through splitting multi-valued cells, but might be created through some other means) makes more sense to me

Would the user need to select the Record Key in the dialog prior to performing the transpose, or could we perhaps make smart default assumptions but still allow them to change the Record Key?

They'd have to choose which column was used as the record key - similar to (some of) the existing Transpose dialogs

Actually - maybe not. The default at least, and maybe only, choice should be the column they are applying the transformation from

Sure Ben. Here's the gist of it. I did it in Python with Pandas. What I first did in OpenRefine was a lot of mess and doing and undoing stuff in the UI. I'm trying out ostephens way now to try to get it done more proper in OpenRefine.

import pandas as pd
import ast
df = pd.read_csv('mydata.csv')

# Function to safely convert a string to a list
def safe_literal_eval(s):
    try:
        return ast.literal_eval(s)
    except (ValueError, SyntaxError):
        # Return None or some default value if the string is malformed
        return None

# Apply the function to the 'some-entity' column
df['some-entity'] = df['some-entity'].apply(safe_literal_eval)
# Initialize an empty dictionary to store the entities and their associated parent IDs
new_dataset = {}

# Iterate through each row in the DataFrame
for index, row in df.iterrows():
    # Get the ID and the list of entities
    id = row['id']
    entitites = row['some-entity']

    # Check if entitites is a list and not None or empty
    if entitites and isinstance(entitites, list):
        for e in entitites:
            # Add the ID to the entities list in the dictionary
            if e in new_dataset:
                new_dataset[e].append(id)
            else:
                new_dataset[e] = [id]

# Now, new_dataset contains each entity and a list of IDs they worked on
print(new_dataset)
new_df = pd.DataFrame(list(new_dataset.items()), columns=['Entity', 'IDs'])

If there were to be a dedicated function for changing the record key, one big potential advantage would be a dialog where users could specify other multi-valued columns to combine before the transposition or split afterwards. Currently a lot of the friction is having to do each column separately, adding multiple steps to the undo/redo.

Imagine, for instance, if each movie also had a year, a director, etc that we wanted to associate with individual actors' records. In my case, it's not movies but texts and coauthors, so every record has a title, a publisher, a date, a URL, etc etc all of which may be the same beforehand (eg. allowing a Fill Down operation after splitting the authors) but different once we've changed the key (requiring a Combine multiple values). It can be done, for sure, but it adds up.

As a practical matter it might make sense to insist that each column to combine will use the same delimiter as the original column to split (ie actors, in the original example).

2 Likes