Reproducibility project: June report

Here is a summary of what I have been working on in the past month in the scope of the reproducibility project.

The bulk of the work this month was focused on testing the new architecture on concrete workflows and fixing bugs as I discovered them. And there were a lot of them! Rather than giving an exhaustive list (the commit log is there for that), let me describe the broad areas:

  • a lot of bugs were about the refresh logic of the frontend: avoiding unnecessary updates and ensuring the ones required are actually happening;
  • error handling. This area has been overlooked for so long (be it in the new architecture or earlier) that there was a lot of work to do there (and still a lot of potential for improvement). Now that we have a panel to display details about the long-running process queue, there was also the opportunity to add dedicated error handling there. If a process encounters an error which should interrupt it, that error is now reported there. Potentially, that could be improved to also make it possible to restart the process (if the cause of the error has been addressed). I introduced this primarily for the Wikibase upload operation, which could not really benefit from the crash recovery feature introduced earlier as it required a new login to the Wikibase instance. I also worked on introducing meaningful status codes in the commands exposed by the backend, as well as a catch-all event listener which reports errors to the user when a backend command fails. The hope with this is that we can have more precise bug reports without having to ask people to check the backend logs.
  • memory management improvements. I reintroduced the caching logic I had removed when working on the partial operation result support, which makes a very noticeable difference to the reactivity of the UI. This is an aspect of the new architecture that I expect will benefit quite a lot from broad testing on various types of workflows, to identify the situations in which it can still get slow.

Towards the end of the month, the problems I was writing down during testing were increasingly often problems that were also present in 3.x (and that I would not fix on the 4.0 branch but rather just open an issue about them). And overall I was really able to do some end-to-end workflows without having to stop every 5 minutes to note down an issue to fix later. To me that's a good sign, meaning that this version is ready for broader testing. One of the testing workflows was Wikidata upload to mark some journalists as being part of the ICIJ via scraping.

The discussion on releasing and merging this architecture with the master branch is ongoing and I am glad I am getting more feedback from the developer team. We will continue the discussion there. Given that the new architecture is a large rewrite of the backend I think it's easier to review a clear documentation of the new structure rather than looking at diffs, but the current documentation is not good enough for that.

One issue that has been preventing more user testing is the unavailability of snapshot releases due to a problem with Sonatype's repository. We need to migrate to something else to restore the snapshots, because it feels like it's something which really does make it easier for a lot of people to try out development versions.

In July I anticipate to work mostly on documentation and facilitating the review of the architecture by other developers. In parallel I want to start doing more user testing and start the recruitment process of the designer.

3 Likes

As an example here is how an internal error in the Wikibase uploading operation would show up:
image

It's far from perfect but at least it does not pretend anymore that the operation is still running whereas it has failed already.