Prior work on schema validation?

abbe98 · June 26, 2023, 8:34am

I just started working on an extension for validating an OpenRefine project against a schema for tabular data(probably Metadata Vocabulary for Tabular Data). I'm, however, wondering if there could be prior work in this space?

antonin_d · June 26, 2023, 11:07am

Nice that you are interested in this area!

There was once an attempt to provide some integration with the Data Package specification. An importer and exporter were implemented. But because the tool lacks metadata fields for projects and for columns, the integration was not very useful in my opinion: there was no meaningful way to provide richer column metadata by cleaning the dataset in OpenRefine, so the data package export was formally a data package, but in practice not much richer than just a CSV file. Similarly, a lot of the metadata fields of Data Packages were not meaningfully used in OpenRefine after import. The integration was more motivated by the aim to announce compatibility with Data Packages rather than being driven by real use cases.
We had to remove this integration because the library it relied on had a non-free dependency.

Before this integration, there was also some debate about Data Packages vs CSV on the Web, which was not super insightful either in my opinion given that it was not really grounded in user workflows either.

When we introduced the Wikibase (then Wikidata) extension, there was also the question of whether the "Issues" tab could be part of a more general quality assurance system: a uniform way for various components to report issues about the data, to help the process of validating its compliance with some sort of specification (be it the data modelling conventions of the Wikibase instance, or a CSVW metadata object, a SQL schema…). I think @ostephens in particular was keen to think about this given his experience with uses of OpenRefine in the GOKb project (which also developed similar functionality in an extension). Given all those different use cases, I think there would be opportunity to design something nice! It would feel like a natural feature for a data cleaning tool. But if you prefer working on something more narrow to fit on your use cases that's totally understandable and will contribute to giving us a better understanding of what a more general system would need to accommodate.

abbe98 · June 26, 2023, 11:45am

Thank you @antonin_d!

Before this integration, there was also some debate about Data Packages vs CSV on the Web, which was not super insightful either in my opinion given that it was not really grounded in user workflows either.

This was very interesting, our reasoning(and needs) are very much governed by our usage of RDF and reconciliation. Therefore CSVW fits the bill very well. Some of the comments on the issue seems to be in the same spirit.

When we introduced the Wikibase (then Wikidata) extension, there was also the question of whether the "Issues" tab could be part of a more general quality assurance system: a uniform way for various components to report issues about the data, to help the process of validating its compliance with some sort of specification (be it the data modelling conventions of the Wikibase instance, or a CSVW metadata object, a SQL schema…). I think @ostephens in particular was keen to think about this given his experience with uses of OpenRefine in the GOKb project (which also developed similar functionality in an extension). Given all those different use cases, I think there would be opportunity to design something nice! It would feel like a natural feature for a data cleaning tool. But if you prefer working on something more narrow to fit on your use cases that's totally understandable and will contribute to giving us a better understanding of what a more general system would need to accommodate.

These thoughts seems very much aligned with ours. We want to create an "Issues"-tab just like for the Wikidata extension where the number of errors can be displayed directly in the tab-title and updated "live" as the project changes(therefore my recent interest in #5335).

(Having watched the GOKb screen-cast I must say that I like the ability to see the issues and main content at the same time, something a new main tab wouldn't help us with.)

In the best of worlds I would store the schema with each project but given that custom metadata fields seems broken(and the issue isn't obvious to me) I might take a shortcut. However, if not it should end up being reusable by others.

thadguidry · March 2, 2025, 12:37pm

One of the things I've been watching in the metadata / schema modeling space is that of LinkML, which has generators in multiple programming languages like Java, etc. What's interesting is that it's framework, database, and serialization agnostic. It's ecosystem is also interesting.
https://linkml.io/linkml/intro/overview.html

LinkML is a flexible modeling language that allows you to author schemas in YAML that describe the structure of your data. Additionally, it is a framework for working with and validating data in a variety of formats (JSON, RDF, TSV), with generators for compiling LinkML schemas to other frameworks.

Topic		Replies	Views
Where is "export"/"import" schema option in OR Wikidata extension now? Support and Helpdesk	9	338	June 9, 2023
Meeting with the DNB community Community	5	132	April 19, 2024
OpenRefine 2024 Barcamp: If only OpenRefine could be more like Development & Design barcamp-2024	0	67	July 9, 2024
Schema loading issues Support and Helpdesk	5	329	March 3, 2023
Mapping OpenRefine Ecosystem Project announcements	0	313	February 2, 2023

Prior work on schema validation?

Related topics