Hi. I'm trying to figure out an efficient way of reconciling arrays. Reconciling the main entity in a dataset works fine, but I have additional properties that I want to reconcile that are stored in arrays/lists. These properties are also names of entities. Example can the main entity could be a company and it could have an array with a list of names of persons (employees).
To reconcile the arrays, I have used "Edit cells" -> "Split multi-valued cells..." to turn the arrays into rows. This works, but this quickly turns into millions of largely duplicate rows, leading to performance issues, where it tries to reconcile the same entity numerous times.
I have tried to use facets to reconcile, thinking maybe I could reconcile against the facet, where it would group all the identical names together, but that leads to 1. The error message (x choices total, too many to display), 2. I haven't found a way to reconcile the facet.
For each array of persons or whatever, I'm thinking the way is perhaps to extract this to a separate dataset where I can then merge the duplicate rows and then do the reconciliation, but then I would need to merge that reconciled dataset into the array of the original dataset somehow.
If the original dataset is dataset A, then first reconcile dataset A. Next split each array from dataset A into new datasets B1, B2, ..., Bn. Where each dataset Bx has an array of ids from dataset A, to keep track of all entities in A that should have the entity Bx in an array. Then deduplicate and reconcile each dataset B1,...,Bn and finally merge data reconciled datasets back into arrays in dataset A.
It seems that while splitting arrays into separate datasets in OpenRefine is possible, I haven't been able to find a way to do it that's not a big mess and not very sensible, so it seems much easier to do in code.
If someone has a suggestion for a good way to split arrays into datasets using OpenRefine itself that would be welcome.
This should then be extracted into another dataset actors for reconciling the actors. This dataset should be contain all unique names of actors and a list of movies they are starring in, similar to this. The list of movies they are starring in should contain an id for the movie, so it can refer correctly to the original dataset.
name | starring in Robert Shaw | ['Jaws'] Richard Dreyfuss | ['Jaws', 'Nuts'] Dee Wallace | ['E.T'] #---------
Following, because I have a similar use-case. @matssk, I don't suppose you'd be willing to share the code you're using for the split and combine? I was planning to work on this next week, but it'd be great to adapt something that already works rather than start from scratch.
Yes - it's a sort of transpose operation I think but not one we currently support. I think the label "Split multi-valued cells into Records" isn't quite right because splitting multi-valued cells always creates records.
Maybe we need an option to Transpose Records to swap values in one col to be the record key? This would mean the user had to create the records first by splitting the multi-valued cells but as a generic function starting from a record (which might be created through splitting multi-valued cells, but might be created through some other means) makes more sense to me
Would the user need to select the Record Key in the dialog prior to performing the transpose, or could we perhaps make smart default assumptions but still allow them to change the Record Key?
Sure Ben. Here's the gist of it. I did it in Python with Pandas. What I first did in OpenRefine was a lot of mess and doing and undoing stuff in the UI. I'm trying out ostephens way now to try to get it done more proper in OpenRefine.
import pandas as pd
import ast
df = pd.read_csv('mydata.csv')
# Function to safely convert a string to a list
def safe_literal_eval(s):
try:
return ast.literal_eval(s)
except (ValueError, SyntaxError):
# Return None or some default value if the string is malformed
return None
# Apply the function to the 'some-entity' column
df['some-entity'] = df['some-entity'].apply(safe_literal_eval)
# Initialize an empty dictionary to store the entities and their associated parent IDs
new_dataset = {}
# Iterate through each row in the DataFrame
for index, row in df.iterrows():
# Get the ID and the list of entities
id = row['id']
entitites = row['some-entity']
# Check if entitites is a list and not None or empty
if entitites and isinstance(entitites, list):
for e in entitites:
# Add the ID to the entities list in the dictionary
if e in new_dataset:
new_dataset[e].append(id)
else:
new_dataset[e] = [id]
# Now, new_dataset contains each entity and a list of IDs they worked on
print(new_dataset)
new_df = pd.DataFrame(list(new_dataset.items()), columns=['Entity', 'IDs'])
If there were to be a dedicated function for changing the record key, one big potential advantage would be a dialog where users could specify other multi-valued columns to combine before the transposition or split afterwards. Currently a lot of the friction is having to do each column separately, adding multiple steps to the undo/redo.
Imagine, for instance, if each movie also had a year, a director, etc that we wanted to associate with individual actors' records. In my case, it's not movies but texts and coauthors, so every record has a title, a publisher, a date, a URL, etc etc all of which may be the same beforehand (eg. allowing a Fill Down operation after splitting the authors) but different once we've changed the key (requiring a Combine multiple values). It can be done, for sure, but it adds up.
As a practical matter it might make sense to insist that each column to combine will use the same delimiter as the original column to split (ie actors, in the original example).