OpenRefine 2024 Barcamp: Support OpenAPI in OpenRefine

During this 2024 Barcamp session, we discussed the support for OpenAPI in OpenRefine. See the Barcamp page.

This discussion is not about documenting OpenRefine's internal API via OpenAPI and making it public.

What is OpenAPI?

OpenAPI is a standard (grammar, mini-language) for specifying an API in a machine-readable way. It describes the types of requests that can be made to an API and the responses to expect from the server. Examples include:

The Swagger editor can be used to edit this description and generate an interactive app to play with this API. OpenAPI fully describes an API, which can be used for:

  • Autogenerating API documentation
  • Working programmatically with an API

Postman example

Postman is a great example of managing authentication with different APIs:

  • Cookies
  • Environment variables
  • CSRF tokens
  • Chaining API calls (auth first and then query)

@Ayushi_Rai references Bruno as an open-source alternative to Postman. However, there aren't any easy ways to combine Postman/Bruno with OpenRefine so far - it's more a tool to take inspiration from.

Difficulties with the current workflow

Understanding the API definition is essential, as most people use the same set of APIs and will know how to navigate them. OpenAPI can help with:

  • Content negotiation (retrieving JSON instead of XML)
  • Setting a user agent
  • Handling rate limits
  • Managing authentication
  • Crafting a GREL expression to generate the right URL
  • Parsing responses with another GREL expression (e.g., parseJson)

Currently, calling an API in GREL is complex, involving concat and escape to format your URL correctly. This advanced GREL scares new users away. See for example).
Note: The concat/escape issue is addressed by PR #6142 and Issue #6140.

Type of Integration

During the workshop we identified different type of integration with the OpenAPI standards.

  1. During project creation
  2. When we add a column
  3. When doing reconciliation
  4. To publish update to the API endpoint

Project Creation

Provided an OpenAPI Manifest, fetch data from endpoints to create a new project in OpenRefine. This will be a new import option. A single API can cover many different things, so we'd need to specify how to fetch a collection of entities from an OpenAPI profile.

Question: How does the user know what feature/field they want from the API?
See also: Allow users to adjust HTTP headers (Authentication, User-Agent, Accept) when creating projects from Web Addresses (URLs #2104).

Add Column from a REST Request

Using the API definitions we can determine what parameters can be used, allowing them to be filled from data in the project (where appropriate) and validate. @thadguidry: Many APIs, especially in government catalogs, offer OpenAPI endpoints. OpenRefine could parse and understand OpenAPI endpoints, showing non-technical users forms/widgets like Swagger Editor to display available API parameters and allow column mapping to those parameters.

We can support an option to let the user select which data to retrieve from the API, so the one does not have the full JSON in a cell to parse. Then we can automatically parse the response into field with their correct data types/properties.

We can borrow a lot from the add data from reconciliation endpoint workflow. @ostephens: OpenAPI could make it possible to offer an integration that wouldn't be as complete as the Wikidata extension but still easier than the status quo. For smaller platforms with fewer OpenRefine users than Wikibase, it might not be worth developing a similar integration. However, a way to upload from OpenRefine would be useful.

Example of making API requests to an OpenAPI endpoint with OpenRefine: Example Gist.

Wrapping an OpenAPI Endpoint as a Reconciliation Service

Benefits: Users are already familiar with the reconciliation workflows, making it more approachable.
@antonin_d mentionned there are discussions around adding write capabilities in the reconciliation protocol.
@thadguidry offered a quick overview of how upload from GridWorks to Freebase was done.

Wikidata Example

Wikidata is implementing a REST API. A thought experiment would be to replace the current implementation with a more generic OpenAPI implementation.
Wikidata:REST API - Wikidata (not fully ready yet). For the Wikidata-extension, authentication is handled by a library that wraps around the Mediawiki API, making it hard to identify generic learnings from this.

Create or Update Items on a REST Endpoint via an Export Dialog

Authentication is critical for uploading data to an API endpoint. Templates to map fields available in the API with the columns in the OpenRefine project are used when:

  • Posting data back to the API
  • Doing searches with the API and using existing columns as search parameters.

@thadguidry: The Arrazzo Specification group has discussed workflow fashion, feedback loops, download, edit, then upload scenarios, and AI or machine-assisted functions. It might be good to talk with someone or reach out to the mailing list about that future.

Change Detection Capture (CDC)

When pushing data to a REST endpoint or generating an export file, it could be useful if OpenRefine could detect which cells have changed and re-export only those changes.

@thadguidry: This touches on CDC - change data capture. A database backend could be very helpful here, or at least a much-improved granular history.
@ostephens: I do that manually by creating a copy of the original column and adding a facet to compare it to the new column.
@mack-v: Having a work timestamp one each records could address that.
@b2m: Re-export to XML/JSON for projects created from such formats should detect changes and re-export only those changes.

Export to XML/JSON

Part of the discussion explore how user currently use the export to XML/JSON to prepare their JSON before sending it to the API endpoint for update.

@mack-v: There are similar needs around exporting to hierarchical formats (JSON, XML). The templating exporter can do this but it gets very technical quickly. Extensions in Excel could be used as inspiration.
@Michael_Markert: The templating exporter can export to almost anything, but it is very technical.
@ostephens: The templating exporter doesn't respect the records mode (repeat the set of JSON records, not just the rows).
@mack-v: You need to collapse records into rows first, which is very annoying. For example, bibliographic records with multiple subject headings need merging before processing into XML/JSON export using the JSON templating exporter. This doesn't understand records, so parsing into templates like:

{{jsonize(forEach(cells["subject"].value.split("|"), v, "<subject>" + v + "</subject>"))}}

@mack-v: We need a place to collect templates and solutions from user groups that tackle specific problems.
@martin: Do we still need better templating features if we can push data with some generic OpenAPI integration?
@mack-v: For many platforms, the OpenAPI route will not be workable (e.g., contributing to GND by creating MARC files or PICA).

See also:

Some templating examples: