Concurrency of long-running operations

Here is a quick demo of what the user experience could look like for row-wise concurrency:

Beyond the scenario of reconciliation followed by data extension presented in the video, I think something like this could be useful in a fairly wide range of scenarios. For instance, the URL-fetching operation is generally very slow if you choose a long wait time between requests, to be polite to the web server. Generally, the HTTP response is processed further, for instance via GREL/Python or reconciliation (see for instance this tutorial). This feature would allow the user to experiment with those downstream processing steps without having to wait for the URL-fetching operation to complete.

After experimenting with it a little, it has become quite clear that the current restriction to row-wise operations is limiting the reach of the feature: records mode support would be really useful. For instance, the data extension operation always runs in records mode, because if the reconciliation service returns multiple values then a record structure is created. I would keep this task on the back burner for now, as I think it's worth validating the basic principle with users first, before investing more time on such a feature, which is not central to the reproducibility topic anyway (it is already an extension of column-based concurrency).

But still, I am quite enthusiastic about the potential effects such a feature can have. Intuitively, it should:

  • make working with slow reconciliation services much less frustrating, because the time it takes for reconciliation to complete isn't such a big deal anymore. Similarly with scraping or working with restrictive APIs.

  • reduce the need to artificially restrict long-running operations on subsets of the dataset to make them tractable. At the moment, this is something I see done very often. The problem is that once you have done many other operations following your original costly operation, there isn't really a good way to relax the original filter and run the whole workflow on a larger part of the dataset:

    • you can fiddle with the JSON representation of the workflow manually (which is pretty error-prone), and even then, you'd be losing all the results of the long-running operation on the subset of the dataset where you have executed it already
    • you can re-run the same steps manually on other parts of the dataset, but for any operation that creates new columns (URL fetching, data extension) you'll have to manually merge the newly-created columns into the ones that were originally created when you first ran the operation on the subset
    • maybe other approaches?

    Instead, with this feature you can just run the operation on the entire dataset, maybe pause it if you want to spare the servers you are hitting, do your downstream processing (evaluating it on the rows that have been already fetched) and later on just resume the original operation to have it process the rest of the dataset, alongside with all of the downstream operations.

  • it brings us closer to the ability of executing entire workflows in a streaming fashion, from the import stage to the export. This would be particularly interesting in a CLI environment, where you'd be able to be able to do something like this (with a purely fictional and non-vetted syntax): cat source_data.csv | refine my_workflow.json > cleaned_data.tsv
    assuming the file my_workflow.json encodes the import metadata, series of operations and export format. If all the operations are row/record wise, then the entire processing could be done in a streaming mode.

1 Like