Add column by fetching URLs results in a blank export

A_G · May 24, 2023, 4:38pm

Hello,
I have been struggling with openrefine now for a while, new user.
I need to get the text of pdf files using the add column by fetching URLs option, I have scrapped around a 1000 pdf links from google scholar.

Im using java apache tika server to fetch these links.

this is the python/jython code:

import urllib2

url = ‘http://localhost:9998/tika’

req = urllib2.Request(url, ‘’)

req.add_header(‘Content-Type’, ‘application/pdf’)

req.add_header(‘fileUrl’, value)

req.get_method = lambda: ‘PUT’

post = urllib2.urlopen(req)

return post.read()

Tika server runs and I see that import has worked by fetching XML version of the text, but when I hit ok and look at the exported column its blank

Ive tried updating the java version which was outdated to no avail, also tried wrapping the html links within quotes, does not work

the following is the screen shot of the cmds that run in my terminal

Thanks for helping me out here,
Cheers
AG

ostephens · May 25, 2023, 2:28pm

Hi @A_G and welcome. It looks like you are mixing up two methods of fetching data from external URLs.

The “Add column by fetching URLs” option expects a URL as an input and then tries to GET the content of that URL (see documentation). By using this with the python code you are posting above, you are telling OpenRefine to use the response you get back from Tika as URL which can be retrieved. This is obviously not what you need.

Because you are already doing the fetch of the URL in the python code you’ve written, you don’t want to use the “Add column by fetching URLs”. What you’ll need to do instead is use “Add column based on this column” and then use the Python code you’ve posted. That should store the returned data in the new column as you are expecting

Hope this helps and please let me know if this doesn’t help or you have further questions

A_G · May 30, 2023, 6:32pm

Thank you so much! Yup it helped !

Topic		Replies	Views
Mining information from PDF URLs using regex Support and Helpdesk	1	385	December 19, 2022
Adding API column with special header Support and Helpdesk	11	328	January 30, 2024
Authentication timeout - add column based on URL Support and Helpdesk	3	33	May 20, 2025
Sometimes OpenRefine 'forgets' to add structured data to Wikimedia Commons and I do not know why Support and Helpdesk wikimedia-commons	6	60	August 10, 2024
Schema loading issues Support and Helpdesk	5	326	March 3, 2023

Add column by fetching URLs results in a blank export

Related topics