Hello,
I have been struggling with openrefine now for a while, new user.
I need to get the text of pdf files using the add column by fetching URLs option, I have scrapped around a 1000 pdf links from google scholar.
Im using java apache tika server to fetch these links.
this is the python/jython code:
import urllib2
url = ‘http://localhost:9998/tika’
req = urllib2.Request(url, ‘’)
req.add_header(‘Content-Type’, ‘application/pdf’)
req.add_header(‘fileUrl’, value)
req.get_method = lambda: ‘PUT’
post = urllib2.urlopen(req)
return post.read()
Tika server runs and I see that import has worked by fetching XML version of the text, but when I hit ok and look at the exported column its blank
Ive tried updating the java version which was outdated to no avail, also tried wrapping the html links within quotes, does not work
the following is the screen shot of the cmds that run in my terminal
Thanks for helping me out here,
Cheers
AG