Concurrency of long-running operations

I have worked on this aspect and have a better architecture to propose now.
I have convinced myself that all of the columnar metadata associated with an operation (representing which columns it reads, writes, deletes or copies) should be enforced by the core platform directly.
This means, if an operation declares that it depends only on certain columns, it will only be able to read data from those columns (which was already the case as mentioned above), but also that it is only able to write data to the columns it declares as such. A corollary of that is that operations don't have to manually mark the columns they modify as such, as it gets handled by the core tool instead.

Going for such an approach (instead of trusting the operations to do what they say) should have multiple benefits:

  • avoid bugs: mismatches between what the operation declares and what it does could mean that it does not execute as expected, for instance if it was allowed to run in parallel of another operation while it should have waited for the results of that operation instead. This is particularly important because extensions are able to define new operations, whose correctness we have no control over.
  • make it easier to implement columnar reading of project data later on. Say we want to read a selection of the columns from a project's current state (for instance the columns that are present in the view, or the columns that are required for facet computation). The project's current state is derived from the initial state by a series of operations. Assuming all of those are row/record wise and declare the columns they depend on and touch, then we are able to compute the total set of columns that should be read to compute all the required operations to generate the required set of columns.
  • make it possible to optimize the computation of rows/records with a pipelining approach. At the moment, when we apply a new row/record wise operation that is computed lazily (such as trimming whitespace or making judgments on reconciled cells), this adds another map operation on top of the existing grid, which will then be executed every time we need to access any data in the project grid. As the user keeps applying more operations, this stack of map operations grows and this can slow down the access to the project data. This slow down can be avoided by caching an intermediate grid in memory or on disk to avoid the slow down, but this also comes at a cost. So the idea behind "pipelining" is to merge many consecutive map operations into just one, which computes the composite function. If operations have clear columnar dependencies then we have an opportunity to optimize this composite function to make it run faster. This is a standard technique that is used in platforms like Spark and that we could try implementing if the need/interest arises. But for now I would say it would be premature optimization.

How to represent the columnar dependencies of OpenRefine operations?

Here is an overview about how I propose to model those. First, as is currently the case, all row/record-wise operations come with an engineConfig which specifies whether the operation is run in rows or records mode, as well as the facets which restrict its execution to a subset of the dataset. On top of that, I am introducing new metadata fields:

Column dependencies

By declaring a list of column names as column dependencies, the operation will be run on rows or records which contain only those columns, in the specified order.

An operation can opt out of this by not declaring this list of column names, in which case it will still be fed with full rows. In that scenario, the operation can potentially read information from any columns. It is useful to let operations still have access to that, even if the operation does not actually depend on all columns: we might just not be able to determine those dependencies statically.

Column insertions / replacements

By declaring a list of column insertions or replacements, an operation is able to isolate the columns that it modifies. Each of those come with:

  • the name of the column
  • a boolean flag indicating whether the column is inserted or replaces another column
  • the name of the column to the right of which the column should be inserted, or if it is a replacement, the name of the column to replace
  • if the column directly obtained by copying another column, the name of that column (otherwise the column is computed by the operation)
  • an optional reconciliation configuration to store in the column metadata

This rather long list of fields is there to make it possible to express most of OpenRefine's current row/record wise operations as a series of such insertions (see the examples below).
When an operation declares insertions / replacements, it is expected to produce rows which only contain those columns (excluding those which are directly copied from other columns). The core of the tool takes care of inserting those values in the full rows by itself.

Note the choice to determine the positions of the inserted columns based on the name of the column it should be inserted after, instead of using column indices. I think this should make for better reproducibility, given how the majority of OpenRefine operations work. Say you came up with a series of operations which let you break down a date represented in a particular format into three columns (day / month / year), inserted after the original column. If you want to reuse this function on another dataset, it is likely that the date column to operate on will not always be at the same position in the table. Therefore it's better to define the position of the new columns to be relative to the original column, rather than using fixed indices. This seems to be a pattern followed by most OpenRefine operations: "add column based on this column", "split column", "add column based on reconciled values" all create new columns after the column they are run on.

Column deletions

In addition to declaring insertions and replacements, an operation can declare a set of column names that it deletes.

Examples

  • Moving a column can be expressed as:

    • dependencies: none
    • insertions/replacements: a single insertion, to put a copy of the column at the new location
    • deletions: the original column
  • Reconciling a column can be expressed as:

    • dependencies: the column being reconciled and any other columns that are used in the reconciliation config
    • insertions/replacement: a single replacement, to replace the unreconciled column by the reconciled one
    • deletions: none
  • Adding column based on this column can be expressed as:

    • dependencies: they are declared if the expression can be analyzed to extract a set of dependent columns. For now, this is only implemented for GREL: Jython/Clojure expressions will always be treated as opaque, relying on all columns. In that case, this will mean that the operation reads all columns.
    • insertions/replacement: a single insertion, after the column it is based on
    • deletions: none
  • Reordering columns:
    The operation currently stores the final positions of the retained columns, which does not make it possible to express it as a series of insertions and deletions, but by making small changes to how the operation is defined we should be able to do so (see the corresponding issue)

  • Data extension:
    It is not a row-wise operation (because it can create additional rows to form a record structure when multi-valued properties are present), so for it falls out of this framework and is therefore treated as an opaque operation (which can read/write arbitrary columns).
    Although the features of declaring column dependencies and insertions is available to operations using the records mode, it is currently only possible for "row-wise transformations applied in records mode", which means that the number of rows is preserved, even though the operation is able to read information from the record containing the row being processed. Record-wise operations in the sense of "mapping one record to another one (of possibly different length)" cannot yet be analyzed in such a fashion. I am not sure if/how to add support for that.

Overview of the current operation hierarchy

The Java classes which implement the Operation interface currently form the hierarchy below (with the "Operation" suffix removed, for readability).
The analysis of columnar dependencies is only available to the operations which extend the RowMapOperation base class. For each of those, I indicate if I was able to make them declare their dependencies and the columns they write to, or indicate why not.

All other operations are still treated as opaque. Some of those could potentially also be analyzed if the model is made flexible enough to accommodate for them.

  • AnnotateOneRow: like the two following ones, they can actually be formulated as row maps, but the operation currently returns additional metadata to enable in-place re-rendering, so that's what was preventing me from reclassifying them as such so far. I think it should be doable though.
  • CellEdit: see above
  • ReconEdit: see above
  • EngineDependent: this is the base class for all operations which respect the row/records mode toggle and the facets
    • BlankDown: this is not a map, because a state is carried between consecutive row/records. But the columnar scope is very clear, so it would be nice to be able to expose that.
    • FillDown: same as above
    • ColumnSplit: the number of columns created depends on the data, so strictly speaking this is not a map. But the columnar action is also clear, so it would be nice to be able to expose that.
    • ExtendData: not captured yet, see the explanation above
    • PerformWikibaseEdits: not captured yet, because there is some non-locality in the way it updates cells reconciled to 'new' with the created items
    • ReconCopyAcrossColumns: it's such a weird operation that we should re-design or drop it
    • RowMap: the base class for all operations which can expose a columnar scope
      • ColumnMove: ✓columnar scope exposed
      • ColumnRemoval: ✓columnar scope exposed
      • ColumnRename: ✓columnar scope exposed
      • ExpressionBased: base class for all operations which rely on evaluating an expression. The isolation of column dependencies is only supported for a subset of GREL expressions
        • ColumnAdditionByFetchingURLs: ✓columnar scope exposed
        • ColumnAddition: ✓columnar scope exposed
        • MassEdit: ✓columnar scope exposed
        • TextTransform: ✓columnar scope exposed
      • ReconClearSimilarCells: ✓columnar scope exposed
      • ReconDiscardJudgments: ✓columnar scope exposed
      • ReconJudgeSimilarCells: ✓columnar scope exposed
      • ReconMarkNewTopics: ✓columnar scope exposed
      • ReconMatchBestCandidates: ✓columnar scope exposed
      • ReconMatchSpecificTopic: ✓columnar scope exposed
      • ReconUseValuesAsIdentifiers: ✓columnar scope exposed
      • Recon: ✓columnar scope exposed
      • RowFlag: not really exposed, because flags and stars don't form actual columns so far (see this post)
      • RowStar: same as above
    • RowRemoval: this is not a row map, but it would still be interesting to expose the facet dependencies for this.
  • KeyValueColumnize: this will remain opaque
  • TransposeColumnsIntoRows: this will remain opaque
  • TransposeRowsIntoColumns: this will remain opaque
  • MultiValuedCellJoin: this could be analyzed if we are able to handle record maps
  • MultiValuedCellSplit: same as above
  • RowReorder: this is not a row map, but it would still be interesting to expose the columnar dependencies
  • SaveWikibaseSchema: the columnar dependencies of the schema could potentially also be exposed

As you can see we already capture a good chunk of the existing operations, including a lot of highly-used ones, but there is still room for more.

Discussion

The model proposed above is an attempt to capture a large part of OpenRefine's operations and render their actions analyzable within this model. It's designed with the following use cases in mind:

  • extract and reapply history (so that we are able to determine the requirements of a particular series of operations about the grid they are applied on and give the user a chance to rename column dependencies if there is a mismatch)
  • concurrent operations (to be able to run operations concurrently if they are not interdependent)
  • visualization of workflows (offering a graph-based representation of the operations and their dependencies, to better understand the structure of the history)
  • deleting / reordering operations in the history
  • the internal optimizations mentioned above (only affecting performance)

It's by no means a definitive model: I hope to expand it to be able to analyze more operations and represent them in a way that makes them generalize correctly in most cases. But maybe you already see some problems with the choices made here.