Fetching urls with arabic values

Hello everyone.

I'm trying to fetch romanization of arabic multi word strings from an online service

If in browser I manually paste the url, then service responds correctly with consistent multi word and lower/upper cases.

In openrefine
"https://transliterate.qcri.org/ar2en/"+ value
gives an empty reponse (and no error)

while fetching
"https://transliterate.qcri.org/ar2en/"+value.escape('url')
service responds just a concatenation without spaces and first letter uppercased

What am I doing wrong? Do I miss some encoding?

A good way to debug URL fetching is to create a column containing the full URL rather than constructing it on the fly, so you can see exactly what's in there.

In this case there's a bug in the URL percent encoding which is generating plus signs (+) instead of percent encoding. Try this:

"https://transliterate.qcri.org/ar2en/"+value.escape('url').replace('+','%20')

Tom

1 Like

I took a closer look at this and there are actually a few related problems:

In openrefine
"https://transliterate.qcri.org/ar2en/"+ value
gives an empty reponse (and no error)

This appears to be by design, but I'd argue is bug, so I've created issue #6137
If you uses all the defaults, you'll still get the old behavior because it ignores errors, but if you enable error storage, you'll be able to see an error like:



|

Illegal character in path at index 37: https://transliterate.qcri.org/ar2en/ احمد احمد

|



|

|

  • | - | - | - |
  1. The below is actually more a case of things being under-documented (#6138):

In this case there's a bug in the URL percent encoding which is generating plus signs (+) instead of percent encoding.

The encoding which is being done is correct for the query string portion of a URL (ie the part after the question mark (?)) and can be used in some other contexts, but, confusingly, URLs require three different types of percent encoding for different portions of the URL. The one that OpenRefine uses isn't legal for the path port of the URL (which is where the Arabic phrase goes in your example). The other two encoding types could be added, but ...

  1. I'd argue that URLs should automatically go through the same encoding process that your browser uses, rather than requiring a separate step by the user, so I've added a feature request for that.

Tom