One of the recurring gripes I hear about in the data science, statistics, and computational biology domains is the lack of seamless workflows for cleaning, analysis, and visualization. Currently OpenRefine doesn’t expose a SQL interface. This makes workflows with tools like Scikit, R Studio, Knime, Pandas, Jupyter, and others cumbersome for round tripping.
I think we could do better post-4.0 and add a Java SQL interface or make progress towards alternative storage configurations (something like Apache Gora or more current Java DB technology). This would also make future features such as data joins and multi-way merging a reality without us having to do a lot of the built in coding but instead allow extensions and tools to read and write to the DB storage layer configured.
1 Like
One new project that seems to be used by a few other data cleaning and analytic software such as Lilac AI and LanceDB is DuckDB which supports fantastic features that are very relevant to how OpenRefine works and how we want it to work better in the future such as parallel vector operations, joins, appends, Parquet, NDJson, ADBC support, etc.
Also have a look at some of the videos from DuckCon #3
A particularly clever language is Malloy which lends itself well for simpler nested, joined, aggregated query syntax and is developed by Lloyd Tabb
Looking at the rapidly changing landscape of existing RDBMS technologies, it appears that graph querying and traversal itself is getting added to more and more. Apache AGE is a PostgreSQL extension that adds graph querying.
There would be nifty things one could do with Record relations and data quality when a graph model is overlaid.
I would love to see a proof of concept where PostgreSQL could be an optional storage and query layer, maybe through the 4.0+ architecture, and as perhaps itself a storage engine extension, that other extensions might then depend upon it.
Is that a possibility if given enough time and effort?
@antonin_d I started helping contribute and sponsor an effort that I thought would make things easier in multiple ways on this issue. Would love to know what you think about perhaps instead just leveraging middleware and what else that middleware might need or how OpenRefine could take advantage? I'd be keen to know what else we might need here for easily writing Recon services that use an existing DB backend, as well as OpenRefine just using the middleware and what more that DB2Rest can do to make that happen.
Blogged about DB2Rest here:
Expose your database as a REST API safely and without code - DEV Community
I've also reached out to Luca with ArcadeDB to get his input (he's a senior Java dev also) and how he thinks this might help "bridge the middle gap and safely" in a lot of ways.
Note: Dhrubo is one of my best friends and we worked together at Ericsson back in the day with Tony (who did OpenRefine's database extension).
I also think that data virtualization itself (a priori Data Presentation) might be ultimately where we really want to head towards.
Maybe we don't even need a SQL backend per se, but only a data management framework that doesn't care where data is stored or located or even its format!
For instance, Apache Calcite is a DBMS without worrying about the D - Data. And OpenRefine users could generate their own virtual joined views (lattices) with ease if we embedded Calcite as an extension (maybe with 4.0 we'd need to write our own small adapter for Calcite) and then our Facets (or new Joining Facets) could be used for grouping and selection of the schema that generated joins, etc.
Lattices (apache.org)
@steve did you know about Calcite?
I have now come full circle . To the point where I don't think it makes sense for OpenRefine to have a SQL backend option itself at all. Let me wax eloquently.
SQL itself is NOT the best language to use when you need to express complex multi-cell/table joins, or filters effectively.
I've been reading and following (papers and code) within research by 蒋步星 who demonstrates that there are better ways of expressing this (as has Lloyd Tabb with Malloy) . He also built esProc SPL to prove out some of his research over the past 10 years with other authors and it's rather nicely done I would say. One thing that is notable is the use of low-level Java working on Bytes and Arrays directly, and pushing operations to higher levels (similar to @antonin_d work and research).
When it comes to Joins, I rather like how SPL handles this, and thoroughly described at:
and
as well as other posts describing its architecture and first thoughts
Good areas of code to review are under:
- esProc/src/main/java/com/scudata/common at master · SPLWare/esProc (github.com)
- esProc/src/main/java/com/scudata/dm at master · SPLWare/esProc (github.com)
But there are other areas as well.
@steve might also take a deeper look into its different way of looking at Join and even comparing Records or handling CellSet's.