Code for checking running OpenRefine

I'm just trying to understand how the code that checks for a running OpenRefine instance works. In refine script we have a check_running() function:

check_running() {

    if [ "$CURL" ] ; then
        curl --noproxy -s -S -f $URL > /dev/null 2>&1
        if [ $CURL_RETURN -eq "7" ] || [ $CURL_RETURN -eq "22" ]  ; then
    elif [ "$WGET" ] ; then
        no_proxy= wget -O - $URL > /dev/null 2>&1  
        if [ "$?" = "4" ] ; then

    if [ -z "${NOT_RUNNING}" ] ; then
        if [ "$CURL" ] ; then
            RUNNING=`curl --noproxy -s $URL | grep "$CHECK_STR"`
        elif [ "$WGET" ] ; then
            RUNNING=`no_proxy= wget -O - $URL| grep "$CHECK_STR"` 
        if [ -z "${RUNNING}" ] ; then
            error "OpenRefine isn't running on $URL. Maybe a proxy issue?"

I have a couple of questions:

  1. Why is the --no-proxy flag applied to rather than ${REFINE_HOST_INTERNAL} ?
  2. Why is the wget command no_proxy= wget -O - $URL > /dev/null 2>&1 rather than wget --no-proxy -O - $URL > /dev/null 2>&1?

I'm asking here before creating an issue because while these things don't look right to me, I'm not confident enough to immediately raise a bug report - so asking here first to check my understanding! is the IP loopback address (on systems that use /etc/hosts it is sometimes mapped to "localhost" alias - but not always - Hi old Sun Systems!)

The no_proxy is an ENV variable assigned with a value of
When needed to have an ENV variable that some tools and commands expect to exist, it's often the case that you can simply set the ENV variable prior to executing the command. So... no_proxy= means don't use a proxy for the loopback address, and then we immediately execute wget ... with it's arguments.

For the wget case, it seems someone knows that wget supports environment variables for proxies Proxies (GNU Wget 1.24.5 Manual)
and so assign the ENV variable no_proxy prior to executing the wget command with arguments.

For curl case, you need to be careful...

  • because --noproxy allows to provide a list of hosts to tell curl that those hosts do NOT need to use a proxy to communicate through. And the --noproxy option overrides ENV variables no_proxy and NO_PROXY.
  • --proxy is how the HTTP or HTTPS proxy is set. It also is an override for any ENV proxy variables set.
  • Scroll down to near the bottom on the Curl man webpage to the Environment section and you'll understand all the ENV options that curl can understand (and sometimes override!) with some of its arguments.

Hi @ostephens,

  1. My understanding is that we want to avoid going through the system's HTTP proxy if we are making a call to any local web service. In that sense, --no-proxy is probably something that ought to be a default in most for curl and wget, but apparently isn't, so that's worth adding, I guess? If you are running OpenRefine to make it reachable from other machines, then the URL we'll have at that stage will be something that's resolvable from the outside and there is no problem with using the system's proxy to check it, I guess? Of course the specifics of networking and hostname resolution really depend on how you're deploying OpenRefine (for instance in some Docker container), in which case what we're doing at the moment might not work. If you are aware of such a situation, perhaps it's worth describing the situation directly.

  2. Looking at wget's man page, this environment variable is indeed used to specify a list of hosts for which we shouldn't use a proxy. According to the same manual, the --no-proxy option of wget does not expect any argument, so it disables the HTTP proxy entirely for any host, I suspect. So if we wanted to use that, we'd rather do wget --no-proxy -O - $URL > /dev/null 2>&1

Also, as a side note, I'm really happy that you are so active again in the project :slight_smile: (In case you are interested to attend the BarCamp happening this month, don't forget to register, even if you only plan to attend remotely)

I guess I was wondering - what happened in this case if the user had set ${REFINE_HOST_INTERNAL} to be localhost. Would this work still or would this then use the proxy (and is using the proxy the right or wrong thing to do in this scenario)

Ah - I hadn't realised that was an option (I'm used to using EXPORT rather than setting in the single line, but I can see here you'd need to unset after and the method used avoids this)

Thanks Antonin - trying to be a bit more active here although sometimes struggling to find the time (I'm still using OR daily and doing training here and there!)

Attending the barcamp in person is a dream but I've registered for remote attendance now - thanks for the nudge

1 Like

If there's a system proxy, OpenRefine should use it automatically in my opinion.
A system proxy is highest, and so all downwind applications should respect "system" settings, and "user" settings.
Usually, it's just about reading the ENV variable, which both curl and wget can do for us in the refine script if properly coded.

Sometimes the proxy is a hard thing to discover on some OS's. But luckily the world has embraced those ENV variables (Java included - which only used to understand the uppercase variables, but I think most HTTP client libraries recognize both upper and lowercase proxy ENV variables now. "I think")

If there's a system proxy, OpenRefine should use it automatically in my opinion.

Isn't the point of this code to not go through the system proxy when trying to access a local address?

My question is, if this is the point, why does the code only target and not localhost? Is that a correct decision for some reason, or are we overlooking something?

@ostephens I was talking about external access (Fetch URLs, etc.) and that I'd like to see OpenRefine automatically use the system configured proxy when doing those operations.

For the check_running() function in the script, Yes, the point of the code is that when trying to access IP address to see if OpenRefine is running, the code should tell OpenRefine NOT to go through the system proxy, because there's no need.

"localhost" is the hostname for the address.
So, the code targets the IP address because it's an IP and not a hostname.
In Windows, the hostname had been traditionally set in C:\Windows\System32\Drivers\etc\hosts
But now (last 15 years) localhost name resolution is handled within DNS itself (look at the comment by Microsoft in my /etc/hosts file below).
My /etc/hosts on Windows 11

Thanks @thadguidry. I understand that localhost usually points at (although I think it can be anything in the 127.*.*.* range?) but I've always specifed and localhost separately in things like browser proxy settings and I wondered if that was also needed here?

(My habit of specifying both is very long standing and quite possibly just I was told this once many years ago and I just copied what I was told and never questioned it!)

I think this is done for 3.8? Handle proxy configuration, closes #5476 (#5477) · OpenRefine/OpenRefine@89e6dbe · GitHub

1 Like

You might be thinking instead of and its relationship to "any adapters" or in other words INADDR_ANY in TCP/IP.
No, we don't need to account for any hostname, that would be something that a server however "might" need to do if OpenRefine was being hosted somewhere besides locally.

So on my mac

http_proxy= curl -vI --noproxy -s -S -f http://localhost:3333/
* Uses proxy env variable http_proxy == ''
* Host was resolved.
* IPv6: (none)
* IPv4:
*   Trying

From my reading of the refine script is that this (with the exception of the -vI flags) is the sort of request that could be made. So here the --noproxy flag is not working to avoid the proxy because we are calling localhost

Note earlier in the script we have

if [ "$REFINE_HOST" = '*' ] ; then
    echo No host specified while binding to interface, guessing localhost.

indicating that localhost is used as a fallback for the REFINE_HOST_INTERNAL variable then used in check_running() in some scenarios

If I specify localhost in my noproxy flag I see what I'd hope for:

http_proxy= curl -vI --noproxy localhost -s -S -f http://localhost:3333/
* Host localhost:3333 was resolved.
* IPv6: ::1
* IPv4:
*   Trying [::1]:3333...
* connect to ::1 port 3333 from ::1 port 60638 failed: Connection refused
*   Trying
* Connected to localhost ( port 3333
> HEAD / HTTP/1.1
> Host: localhost:3333
> User-Agent: curl/8.6.0
> Accept: */*
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Date: Wed, 12 Jun 2024 14:51:01 GMT
Date: Wed, 12 Jun 2024 14:51:01 GMT
< Set-Cookie: host=.butterfly; Path=/
Set-Cookie: host=.butterfly; Path=/
< Expires: Thu, 01 Jan 1970 00:00:00 GMT
Expires: Thu, 01 Jan 1970 00:00:00 GMT
< Content-Type: text/html;charset=utf-8
Content-Type: text/html;charset=utf-8
< Transfer-Encoding: chunked
Transfer-Encoding: chunked

* Connection #0 to host localhost left intact

Based on the outcome of this test I'm reasonably certain we need to include localhost in the noproxy list even though it resolves to Issue raised When checking for a running open refine localhost should be included in the no proxy list · Issue #6673 · OpenRefine/OpenRefine · GitHub