Add column by fetching URLs results in a blank export

Hello,
I have been struggling with openrefine now for a while, new user.
I need to get the text of pdf files using the add column by fetching URLs option, I have scrapped around a 1000 pdf links from google scholar.

Im using java apache tika server to fetch these links.

this is the python/jython code:

import urllib2

url = ‘http://localhost:9998/tika

req = urllib2.Request(url, ‘’)

req.add_header(‘Content-Type’, ‘application/pdf’)

req.add_header(‘fileUrl’, value)

req.get_method = lambda: ‘PUT’

post = urllib2.urlopen(req)

return post.read()

Tika server runs and I see that import has worked by fetching XML version of the text, but when I hit ok and look at the exported column its blank

Ive tried updating the java version which was outdated to no avail, also tried wrapping the html links within quotes, does not work

the following is the screen shot of the cmds that run in my terminal

Thanks for helping me out here,
Cheers
AG

Hi @A_G and welcome. It looks like you are mixing up two methods of fetching data from external URLs.

The “Add column by fetching URLs” option expects a URL as an input and then tries to GET the content of that URL (see documentation). By using this with the python code you are posting above, you are telling OpenRefine to use the response you get back from Tika as URL which can be retrieved. This is obviously not what you need.

Because you are already doing the fetch of the URL in the python code you’ve written, you don’t want to use the “Add column by fetching URLs”. What you’ll need to do instead is use “Add column based on this column” and then use the Python code you’ve posted. That should store the returned data in the new column as you are expecting

Hope this helps and please let me know if this doesn’t help or you have further questions

Thank you so much! Yup it helped !

1 Like