Data fetching under cloudflare

Dear all

OCLC's classify service is now cloudflare-enabled (since last week).

Previously we were able to fetch HTML/XMl for determining the most frequent class number for a book against it's ISBN e.g. -

Now it throws error like this:

org.apache.hc.client5.http.ClientProtocolException: HTTP error 403 : Forbidden for URL

Any way out?


When I click on your link to open it in my browser, it shows me a Cloudflare page with a captcha to solve, to reject robots.
Obviously, if the page is intended to work as an API, it's a mistake to put it behind such a protection as it defeats the purpose.
If it's a web interface you are trying to scrape, perhaps you could get away with:

  • first access the page through your browser, solving any captchas you get
  • second, run the URL-fetching operation from OpenRefine, setting an identical user-agent as the one used by your browser

With some luck that User-Agent will be enough, but it can well be that Cloudflare relies on other attributes to classify users (such as cookies), in which case you'll be blocked again.

You could still try to circumvent that further, by doing the URL fetching in Jython, where you'll have full control over the headers you are passing to the web service. You could then replicate exactly what your browser is sending. There are some utilities in the web developer tools of most browsers to copy a given request as a Python command or cURL command: given that Jython uses Python 2.7 you can probably not reuse those as is but at least it gives you access to all the headers in a simple way. But I'd say at that stage you are reaching the limits of scraping in OpenRefine, there might be better tools for the job.

Based on the information on this page it looks like data that was previously available via a non-secured
API is now restricted and requires a key