Randomization of records

Dear all

We have a dataset (around 750K) obtained from the Medline/Pubmed baseline repository and curated in OpenRefine. It is meant for an experiment with a machine learning framework, which takes training data in the following TSV format: column 1 - text corpus and column 2 - MeSH descriptor(s) (URIs separated by a space).

Before deploying this dataset for training different machine learning backends, we want to arrange them in random order to avoid any biases in sequencing and to prepare a representative dataset of Medline/Pubmed (they have around 3000K bibliographic records upto 2023).

How can we do random organization of rows/records in this dataset in OpenRefine? Our OpenRefine is version 3.8.4 with GoKB extension.

Thank and regards

-Parthasarathi

Create a new column filled with a random value (use the grel random() function to do so), and then, permenantly reorder the rows with it. If you use the record link between rows, give each rows inside the same record the same random number. The cross() function can help you with that.

Regards, Antoine

1 Like

Thanks. It worked.

Regards

1 Like