Hello! I'm trying to pull data into an OpenRefine project from a CSV export on an open data portal. The URL to the CSV redirects to a signed S3 URL with expiring credentials, but OpenRefine doesn't appear to be following the redirect when I use the "Import from Web Addresses" option.
Here's the URL: https://data.boston.gov/dataset/e63a37e1-be79-4722-89e6-9e7e2a3da6d1/resource/73c7e069-701f-4910-986d-b950f46c91a1/download/tmp3fykdojs.csv
... which redirects to a URL like this, with credentials that expire after 24 hours:
https://s3.amazonaws.com/og-production-open-data-bostonma-892364687672/resources/73c7e069-701f-4910-986d-b950f46c91a1/tmp3fykdojs.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJIENTAPKHZMIPXQ%2F20250318%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250318T184820Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=3851cad29b5bda96d901f33f4d8c0d2f6c44839c70dc8c29c9569e29239bdfcc
Is there any way around this problem other than setting up a routine to download the file locally each time?
Thank you! Appreciate all the work everyone has put into this tool, use it all the time.
TL/DR: I can't get this to work, and there is odd behaviour, it maybe worth raising a Github issue for a developer to dig more deeply
My understanding is that the import from web addresses should follow the redirect automatically. Trying to test the URLs with cURL, I get the redirect fine from the initial URL, but then if I try to cURL the resulting S3 URL I get a 403 response which doesn't tell us a huge amount - it means we're not authorised to access that URL but it doesn't tell us why - potentially a TLS/SSL issue, or maybe something else
On the otherhand it works fine in the browser.
Trying in OpenRefine, I can't get either URL to work when using "Import from web addresses" - I just get the error Error uploading data null
in both cases which is unhelpful
If I try with the "Add column by fetching URLs" from inside an existing OpenRefine project the S3 URL works fine, but the original data.boston.gov url gives me an error
org.apache.hc.client5.http.ClientProtocolException: HTTP error 403 : Forbidden for URL https://data.boston.gov/dataset/e63a37e1-be79-4722-89e6-9e7e2a3da6d1/resource/73c7e069-701f-4910-986d-b950f46c91a1/download/tmp3fykdojs.csv
Again - its a 403 response that could cover a multitude of sins - it seems odd to get that for a URL that should be openly available and is just a redirect.
That's as far as I could get easily I'm afraid, so I don't have a solution for you
Owen
Thanks for taking a look! I think the 403 error is because the URL there is unsigned. I'll open a GitHub issue.