As mentioned in my monthly report, I have started working on adding support for concurrent long-running operations for the 4.0 version.
Here is a short video demonstrating one use case:
Why work on this now?
The tasks that are coming up in this project rely on one crucial architectural decision: how to represent the columnar dependencies of operations at the core of the tool. We need a representation which enables a wide range of use cases, so it's not easy to get it right. A waterfall-like development process could consist in first designing this new metadata format, then migrating all the operations to expose such information, and finally implement all the use cases enabled by that. But of course that's running the risk that the architecture chosen to represent the dependencies does not quite address the desired use cases, and so migrating again a lot of operations to expose more metadata later on.
Instead, I am trying another approach: experiment with introducing just enough metadata for one use case at a time, and see where it takes me. The additional metadata I am introducing is always optional, because we need to cater for the needs of special operations whose dependencies cannot be isolated clearly (such as the transpose operations). This means that most operations can be left as is in this first prototyping phase: they are just treated as black boxes by the execution engine. I am therefore concentrating my testing around a sample of operations.
Multiple types of concurrency
I have identified two types of concurrency worth implementing in the tool:
- a row-wise concurrency, where the rows are streamed through the pipeline of operations. This is a form of concurrency, because the operations are running simultaneously: the first row of the table can make it through the entire pipeline before the 100th row has even started to be processed by the first operation. This is a form of concurrency implemented in a lot ETL tools, like Knime or Apache NiFi. When the records mode is used, this would becomes a record-wise concurrency.
- a column-wise concurrency, where operations which work on different columns are executed independently. If you fetch URLs from one column and reconcile another, the two processes should be able to carry on independently.
In my opinion, both types of concurrency would be useful in OpenRefine. The row-wise (or record-wise) concurrency is useful on large datasets, by essentially making it possible to compose your data cleaning workflow by looking at a sample of the data and let the execution follow on the rest of the dataset. But with this concurrency only, the operations you execute towards the end of the pipeline can get held up by any previous slow operations, even though the outcome of those is not actually needed to execute the later operations. So this can lead to unnecessary waits that a columnar analysis can avoid.
For now I have been focusing on the column-wise concurrency, because it relates to those wider reproducibility use cases and is therefore helpful to identify the representation of column dependencies that is needed. But I do plan to implement row/record-wise concurrency as well.
Overview of the dependency representation so far
I have modified the column metadata to record, for each column:
- at which step in the history the column was last modified (
lastModified
) - what name the column had at this step (
originalName
)
With those simple ingredients, when starting a long-running operation (that is to say an operation which must persist the data it fetches or computes), we can do the following (roughly):
- analyze which columns are read by that long-running operation
- for each of those columns, read from the column metadata the pair
(lastModified, originalName)
- find the earliest position in the history where all those
(lastModified, originalName)
pairs are present in the column model - run the operation from that position (instead of the current one) as soon as the grid at that position is fully computed
When looking for an earlier state in the history from which to compute the operation, we also ensure that the row / record ids are preserved by the operations browsed. This is done thanks to the existing row / record preservation metadata (introduced in the first phase of the project).
For now, it is the responsibility of any transformation deriving a new Grid
to update those fields in the new grid. I am anticipating that there should be better ways to ensure that those get updated per the architecture itself. At least for now, if an operation declares it depends only on a certain set of columns, it is only given access to those columns during the computation, so that is already one useful guarantee. I think it should be possible to do something similar the other way around, for modified columns.
Columnar reading of project data
Looking at the turn things are taking, it seems increasingly easy to support columnar reading of the project grid. By this I mean, only reading from disk the columns that are needed for a particular computation. This could potentially even apply to displaying the grid, only reading the columns which are visible onscreen. I would not be surprised if the requirements for such a feature end up being closely aligned with the reproducibility requirements. But because this is only a performance optimization, which does not seem crucial for most of our current users, this is not something I plan to invest a lot of energy into (for now). I just have it on my radar and will keep it in mind in architectural changes.