A new start for the 4.0 architecture

Hi all!

It’s November already! And that means the WMF sponsored project to develop integration with Wikimedia Commons is over. I was initially not supposed to be actively developing for it, but things turned out differently, so I did not have much time to work on the new architecture. Now I am happy to be back on this big working site.

The 4.0 branch has not really picked up much activity so far - and that is mostly because I feel like I want to improve it further before really trying to push it out to the public. So I want to give here an overview of what I plan to work on in the coming weeks/months.

  • Catching up with updates on the master branch: I have been merging master into 4.0 and I should be done soon. Such a merge is a major undertaking because the branches are diverging in big ways now. This was somewhat complicated by the recent re-linting of the Java code base, and I learnt a few lessons with this task:
    • I am all the more motivated to extract the Wikibase extension to its own repository. And probably others, like the database or Google data extension. A lot of development activity happened in those extensions, so not having to migrate those commits to the new architecture would be a bonus. Of course that migration will also have to be done eventually, but more modularity means that we do not have to do all those migrations at once.
    • Tests! They are really invaluable for this activity. When reviewing hundreds of merge conflicts (more precisely, more than 600 files), it is easy to make the wrong decision about a conflict. Tests provide redundancy: if a code change came with a test, it is unlikely that both the code change and the new test get lost in the merge. I had a few failing tests which helped me correct the merge.
    • It is worth investing in tooling before undertaking such a merge: tweaking the merge settings and observing the sorts of conflicts we get. I would also be interested in the possibility to review the results of git’s rename detection, and fixing them manually myself. Due to namespace changes and restructuring of the repository, a lot of files have changed places, and sometimes git fails to see that (either failing to detect that one file is a rename of another, or mistakenly conflating two unrelated files). I haven’t found a way to do that - I would be interested to know if people know how to work around that.
  • Developing some quick prototypes of appealing new features in the 4.0 branch. For now, from a user perspective the 4.0 branch only provides different memory management. It can already be useful in some cases but it’s not a killer feature which will encourage people to migrate to that. Also, extension maintainers will need to invest quite a lot of effort to migrate their extensions to this architecture, and this will not happen unless users are really enthusiastic about using this new version. For now I am thinking about the following:
    • Partial evaluation of long running operations. Basically, seeing the first few cells of the projects reconciled after a few seconds, without having to wait for the entire project to be reconciled. The goals would be to give people immediate feedback about the settings of their long-running operation (being able to realize that you should have run the reconciliation with different settings), being to let people run multiple long-running operations at the same time (reconciling two different columns at the same time), manually reviewing and matching cells as reconciliation is running, and so on. From my experience of working on Wikidata imports or scraping projects (with the URL fetching operation) it would make a really big difference.
    • Native command line interface to run OpenRefine workflows. There is already openrefine-client, a Python library which also provides a command-line client for OpenRefine, but I would be interested to see if the modularization of our code into different Maven modules (which is already present in 4.0) could be used to build a CLI in Java, which would directly depend on the appropriate Maven artifacts published by OpenRefine. This would have the advantage of providing a CLI which does not depend on running OpenRefine alongside it - the CLI would be crunching the data itself. While a CLI will only appeal to a small proportion of our user base (since it is for the more tech-savvy people), I see this as a useful experiment to validate the modularization of our code and a necessary utility to test the Spark integration properly. Running OpenRefine on Spark with the web UI is possible, but it is unlikely people really want to do that. I am thinking about experimenting with such a native CLI in an external repository (in the interest of keeping the scope of the main repository narrow).
    • Implementing some strategic long-standing feature requests like the infamous issue #33, which can make a big difference in some workflows. Issue #33 is a good candidate because the new architecture in 4.0 comes with changes to paging which should make this easier. But I am also considering implementing things which are only tangentially related to the new architecture - just for the sake of having juicy new stuff in 4.0.

Above all I want to be able to share prototypes of those projects early on, and iterate with community feedback. The migration to the new architecture was problematic in this respect: there was so much refactoring (all operations, importers, exporters, facets, and more) that this created a big barrier, a big chunk of work that had to be done before a functional tool could be tried out.

And hopefully this forum becomes a lively place where such exchanges can happen!

1 Like