Reproducibility project: May progress report

Hi all,

Here is a summary of what I have been working on in May for the reproducibility project. Overall, my goals were twofold:

  • stabilize the code base to aim for a 4.0-alpha2 release, ironing out the bugs I am aware of;
  • in parallel, continue the design of the architecture for columnar operations (being able to detect the column dependencies of operations, to enable visualizing them, running operations in parallel when possible, and adapting workflows to grids with mismatching column names)

It was perhaps not the best idea to work on those two things in parallel because they are pretty different in nature and switching between the two is not so easy. It felt a bit disorganized. Lesson learned!

For the stabilization of the code base, I worked on all sorts of small bugs that I discovered by testing the tool interactively on various datasets. A lot of them are related to the introduction of partially-computed grids. I have been trying to iron out the user experience there and there are still a couple of things I want to introduce to make it (in my opinion) really smooth and natural.

In that vein, one thing I have worked on is making columns resizable (#4806). I am not sure it was a great idea to work on this because it does feel like a new feature and not just polishing things up. The reason why I felt this was in scope is that displaying partially-computed grids requires to update them regularly, and doing so can be disruptive if the table layout changes too much when content gets added into cells. Think about the workflow of creating a new column by fetching URLs: the contents returned by the HTTP requests will often be quite large, so they will likely expand the columns and the rows as they come in. By letting the user set column widths themselves, we are moving to a layout where column width is fixed, which avoids re-flowing a lot of content whenever a big cell is inserted. Together with a cap on cell height (#1440) this should limit layout changes to an acceptable range.

Anyway, even though I am not sure it was the right time to work on this, I am pretty happy with the result. Beyond making it possible to resize columns it also has a big impact on the initial layout of a grid, and in all datasets I tried, it is in my opinion a very clear improvement. We have much less useless whitespace, mainly for columns with a long title and short values.

On the columnar operations front, I sat down to think about possible architectures. I reached the conclusion that the features we want to deliver can be implemented with relatively lightweight changes on top of the existing architecture, just by adding a bit of metadata on operations. We could want to implement this columnar architecture deeper in the computation engine, so that only the required columns would be read from disk. This would require storing project data in a columnar format such as Parquet and it would likely help with speeding up facet computation, for instance. But implementing something like this seems pretty invasive: I have not found a way to make this work nicely with the existing Runner / Grid / ChangeData interfaces.

While thinking about this, I was often confronted with the question of whether a new piece of metadata or a new method should go at the Operation or at the Change level. After a while I realized that those two levels were in fact redundant (#5856). So I went on to merge the two. Although it was a fairly tedious refactoring, I think it was really worth it because it reduced the complexity of the code base quite a lot. The Operation interface now looks like this:

/**
 * An operation represents one step in a cleaning workflow in Refine. It applies to a single project by via the
 * {@link #apply(Grid, ChangeContext)} method. The result of this method is then stored in the
 * {@link org.openrefine.history.History} by an {@link org.openrefine.history.HistoryEntry}.
 * 
 * Operations only store the metadata for the transformation step. They are required to be serializable and
 * deserializable in JSON with Jackson, and the corresponding JSON object is shown in the JSON export of a workflow.
 * Therefore, the JSON serialization is expected to be stable and deserialization should be backwards-compatible.
 */
@JsonTypeInfo(use = JsonTypeInfo.Id.CUSTOM, include = JsonTypeInfo.As.PROPERTY, property = "op", visible = true)
@JsonTypeIdResolver(OperationResolver.class)
public interface Operation {

    /**
     * Derives the new grid state from the current grid state. Executing this method should be quick (even on large
     * datasets) since it is expected to just derive the new grid from the existing one without actually executing any
     * expensive computation. Long-running computations should rather go in the derivation of a {@link ChangeData} which
     * will be fetched asynchronously.
     * 
     * @param projectState
     *            the state of the grid before the change
     * @return an object which bundles up various pieces of information produced by the operation: primarily, the new
     *         grid after applying the operation. This object can be subclassed to expose more information, which should
     *         be serializable with Jackson so that it reaches the frontend.
     * @throws OperationException
     *             when the change cannot be applied to the given grid
     */
    public ChangeResult apply(Grid projectState, ChangeContext context) throws OperationException;

    /**
     * A short human-readable description of what this operation does.
     */
    @JsonProperty("description")
    public String getDescription();

    /**
     * Could this operation be meaningfully re-applied to another project, or is it too specific to the data in this
     * project? Operations which affect a single row or cell designated by a row index should return false, indicating
     * that they are small fixes that should likely not be part of a reusable pipeline.
     */
    @JsonIgnore // this can be derived from operation metadata itself
    public default boolean isReproducible() {
        return true;
    }

    @JsonIgnore // the operation id is already added as "op" by the JsonTypeInfo annotation
    public default String getOperationId() {
        return OperationRegistry.s_opClassToName.get(this.getClass());
    }
}

I would likely still change ChangeResult and ChangeContext classes so to say "Operation" rather than "Change", but still, I hope people can appreciate the simplicity of the whole thing. Implementing an operation just amounts to giving a function, which applies the operation to a grid, and a bit of metadata (description, reproducibility). It's really a lot simpler than before, where an Operation had to create a Process, which itself generated a Change, which had to be able not just to apply the operation but also to undo it.

Another thing to celebrate this month is the extraction of the Spark integration to an extension developed in a separate repository (#4396). It makes the repository lighter and confirms that we are really decoupled from it by now.

In June, I want to re-focus on getting 4.0-alpha2 out of the door. It is difficult for me to estimate to what quality level it should be, i.e. what fixes I can leave for later, but by now I have the feeling that I should really lower my quality standards and publish something still. It's really just meant to be a preview we can send to people to get their opinion on design choices.

2 Likes