Prior work on schema validation?

I just started working on an extension for validating an OpenRefine project against a schema for tabular data(probably Metadata Vocabulary for Tabular Data). I'm, however, wondering if there could be prior work in this space?

Nice that you are interested in this area!

There was once an attempt to provide some integration with the Data Package specification. An importer and exporter were implemented. But because the tool lacks metadata fields for projects and for columns, the integration was not very useful in my opinion: there was no meaningful way to provide richer column metadata by cleaning the dataset in OpenRefine, so the data package export was formally a data package, but in practice not much richer than just a CSV file. Similarly, a lot of the metadata fields of Data Packages were not meaningfully used in OpenRefine after import. The integration was more motivated by the aim to announce compatibility with Data Packages rather than being driven by real use cases.
We had to remove this integration because the library it relied on had a non-free dependency.

Before this integration, there was also some debate about Data Packages vs CSV on the Web, which was not super insightful either in my opinion given that it was not really grounded in user workflows either.

When we introduced the Wikibase (then Wikidata) extension, there was also the question of whether the "Issues" tab could be part of a more general quality assurance system: a uniform way for various components to report issues about the data, to help the process of validating its compliance with some sort of specification (be it the data modelling conventions of the Wikibase instance, or a CSVW metadata object, a SQL schema…). I think @ostephens in particular was keen to think about this given his experience with uses of OpenRefine in the GOKb project (which also developed similar functionality in an extension). Given all those different use cases, I think there would be opportunity to design something nice! It would feel like a natural feature for a data cleaning tool. But if you prefer working on something more narrow to fit on your use cases that's totally understandable and will contribute to giving us a better understanding of what a more general system would need to accommodate.

1 Like

Thank you @antonin_d!

Before this integration, there was also some debate about Data Packages vs CSV on the Web, which was not super insightful either in my opinion given that it was not really grounded in user workflows either.

This was very interesting, our reasoning(and needs) are very much governed by our usage of RDF and reconciliation. Therefore CSVW fits the bill very well. Some of the comments on the issue seems to be in the same spirit.

When we introduced the Wikibase (then Wikidata) extension, there was also the question of whether the "Issues" tab could be part of a more general quality assurance system: a uniform way for various components to report issues about the data, to help the process of validating its compliance with some sort of specification (be it the data modelling conventions of the Wikibase instance, or a CSVW metadata object, a SQL schema…). I think @ostephens in particular was keen to think about this given his experience with uses of OpenRefine in the GOKb project (which also developed similar functionality in an extension). Given all those different use cases, I think there would be opportunity to design something nice! It would feel like a natural feature for a data cleaning tool. But if you prefer working on something more narrow to fit on your use cases that's totally understandable and will contribute to giving us a better understanding of what a more general system would need to accommodate.

These thoughts seems very much aligned with ours. We want to create an "Issues"-tab just like for the Wikidata extension where the number of errors can be displayed directly in the tab-title and updated "live" as the project changes(therefore my recent interest in #5335).

(Having watched the GOKb screen-cast I must say that I like the ability to see the issues and main content at the same time, something a new main tab wouldn't help us with.)

In the best of worlds I would store the schema with each project but given that custom metadata fields seems broken(and the issue isn't obvious to me) I might take a shortcut. However, if not it should end up being reusable by others.