Hello,
I have been struggling with openrefine now for a while, new user.
I need to get the text of pdf files using the add column by fetching URLs option, I have scrapped around a 1000 pdf links from google scholar.
Im using java apache tika server to fetch these links.
Hi @A_G and welcome. It looks like you are mixing up two methods of fetching data from external URLs.
The “Add column by fetching URLs” option expects a URL as an input and then tries to GET the content of that URL (see documentation). By using this with the python code you are posting above, you are telling OpenRefine to use the response you get back from Tika as URL which can be retrieved. This is obviously not what you need.
Because you are already doing the fetch of the URL in the python code you’ve written, you don’t want to use the “Add column by fetching URLs”. What you’ll need to do instead is use “Add column based on this column” and then use the Python code you’ve posted. That should store the returned data in the new column as you are expecting
Hope this helps and please let me know if this doesn’t help or you have further questions