Data fetching under cloudflare

psm · September 11, 2023, 10:26am

Dear all

OCLC's classify service is now cloudflare-enabled (since last week).

Previously we were able to fetch HTML/XMl for determining the most frequent class number for a book against it's ISBN e.g. - http://classify.oclc.org/classify2/ClassifyDemo?search-standnum-txt=8179100898&startRec=0.

Now it throws error like this:

org.apache.hc.client5.http.ClientProtocolException: HTTP error 403 : Forbidden for URL http://classify.oclc.org/classify2/ClassifyDemo?search-standnum-txt=978-3-030-84717-3&startRec=0

Any way out?

Regards

antonin_d · September 11, 2023, 10:53am

When I click on your link to open it in my browser, it shows me a Cloudflare page with a captcha to solve, to reject robots.
Obviously, if the page is intended to work as an API, it's a mistake to put it behind such a protection as it defeats the purpose.
If it's a web interface you are trying to scrape, perhaps you could get away with:

first access the page through your browser, solving any captchas you get
second, run the URL-fetching operation from OpenRefine, setting an identical user-agent as the one used by your browser

With some luck that User-Agent will be enough, but it can well be that Cloudflare relies on other attributes to classify users (such as cookies), in which case you'll be blocked again.

You could still try to circumvent that further, by doing the URL fetching in Jython, where you'll have full control over the headers you are passing to the web service. You could then replicate exactly what your browser is sending. There are some utilities in the web developer tools of most browsers to copy a given request as a Python command or cURL command: given that Jython uses Python 2.7 you can probably not reuse those as is but at least it gives you access to all the headers in a simple way. But I'd say at that stage you are reaching the limits of scraping in OpenRefine, there might be better tools for the job.

ostephens · September 11, 2023, 5:30pm

Based on the information on this page it looks like data that was previously available via a non-secured
API is now restricted and requires a key http://classify.oclc.org/classify2/api_docs/index.html

Topic		Replies	Views
Fetching url encountered an error because the web is protected by Cloudflare Support and Helpdesk	0	18	September 3, 2024
Scopus API in OpenRefine Support and Helpdesk	4	473	February 2, 2023
Adding API column with special header Support and Helpdesk	11	336	January 30, 2024
Developer and Community Engagement update: April 2025 Community	10	87	May 21, 2025
Problems with VIAF Reconciliation Support and Helpdesk reconciliation	31	369	June 13, 2025

Data fetching under cloudflare

Related topics