SQL backend option

One of the recurring gripes I hear about in the data science, statistics, and computational biology domains is the lack of seamless workflows for cleaning, analysis, and visualization. Currently OpenRefine doesn’t expose a SQL interface. This makes workflows with tools like Scikit, R Studio, Knime, Pandas, Jupyter, and others cumbersome for round tripping.

I think we could do better post-4.0 and add a Java SQL interface or make progress towards alternative storage configurations (something like Apache Gora or more current Java DB technology). This would also make future features such as data joins and multi-way merging a reality without us having to do a lot of the built in coding but instead allow extensions and tools to read and write to the DB storage layer configured.

One new project that seems to be used by a few other data cleaning and analytic software such as Lilac AI and LanceDB is DuckDB which supports fantastic features that are very relevant to how OpenRefine works and how we want it to work better in the future such as parallel vector operations, joins, appends, Parquet, NDJson, ADBC support, etc.

Also have a look at some of the videos from DuckCon #3

A particularly clever language is Malloy which lends itself well for simpler nested, joined, aggregated query syntax and is developed by Lloyd Tabb