Code for checking running OpenRefine

ostephens · June 5, 2024, 10:19am

I'm just trying to understand how the code that checks for a running OpenRefine instance works. In refine script we have a check_running() function:

check_running() {
    check_downloaders
    URL="http://${REFINE_HOST_INTERNAL}:${REFINE_PORT}/"
    CHECK_STR="<title>OpenRefine</title>"

    if [ "$CURL" ] ; then
        curl --noproxy 127.0.0.1 -s -S -f $URL > /dev/null 2>&1
        CURL_RETURN=$?
        if [ $CURL_RETURN -eq "7" ] || [ $CURL_RETURN -eq "22" ]  ; then
            NOT_RUNNING="1"
        fi
    elif [ "$WGET" ] ; then
        no_proxy=127.0.0.1 wget -O - $URL > /dev/null 2>&1  
        if [ "$?" = "4" ] ; then
            NOT_RUNNING="1"
        fi
    fi    

    if [ -z "${NOT_RUNNING}" ] ; then
        if [ "$CURL" ] ; then
            RUNNING=`curl --noproxy 127.0.0.1 -s $URL | grep "$CHECK_STR"`
        elif [ "$WGET" ] ; then
            RUNNING=`no_proxy=127.0.0.1 wget -O - $URL| grep "$CHECK_STR"` 
        fi    
        
        if [ -z "${RUNNING}" ] ; then
            error "OpenRefine isn't running on $URL. Maybe a proxy issue?"
        fi
    else
        RUNNING=""
    fi
}

I have a couple of questions:

Why is the --no-proxy flag applied to 127.0.0.1 rather than ${REFINE_HOST_INTERNAL} ?
Why is the wget command no_proxy=127.0.0.1 wget -O - $URL > /dev/null 2>&1 rather than wget --no-proxy 127.0.0.1 -O - $URL > /dev/null 2>&1?

I'm asking here before creating an issue because while these things don't look right to me, I'm not confident enough to immediately raise a bug report - so asking here first to check my understanding!

thadguidry · June 5, 2024, 10:48am

127.0.0.1 is the IP loopback address (on systems that use /etc/hosts it is sometimes mapped to "localhost" alias - but not always - Hi old Sun Systems!)

The no_proxy is an ENV variable assigned with a value of 127.0.0.1
When needed to have an ENV variable that some tools and commands expect to exist, it's often the case that you can simply set the ENV variable prior to executing the command. So... no_proxy=127.0.0.1 means don't use a proxy for the loopback address, and then we immediately execute wget ... with it's arguments.

For the wget case, it seems someone knows that wget supports environment variables for proxies Proxies (GNU Wget 1.24.5 Manual)
and so assign the ENV variable no_proxy prior to executing the wget command with arguments.

For curl case, you need to be careful...

because --noproxy allows to provide a list of hosts to tell curl that those hosts do NOT need to use a proxy to communicate through. And the --noproxy option overrides ENV variables no_proxy and NO_PROXY. https://curl.se/docs/manpage.html#--noproxy
--proxy is how the HTTP or HTTPS proxy is set. It also is an override for any ENV proxy variables set. https://curl.se/docs/manpage.html#-x
Scroll down to near the bottom on the Curl man webpage to the Environment section and you'll understand all the ENV options that curl can understand (and sometimes override!) with some of its arguments.

antonin_d · June 5, 2024, 10:48am

Hi @ostephens,

My understanding is that we want to avoid going through the system's HTTP proxy if we are making a call to any local web service. In that sense, --no-proxy 127.0.0.1 is probably something that ought to be a default in most for curl and wget, but apparently isn't, so that's worth adding, I guess? If you are running OpenRefine to make it reachable from other machines, then the URL we'll have at that stage will be something that's resolvable from the outside and there is no problem with using the system's proxy to check it, I guess? Of course the specifics of networking and hostname resolution really depend on how you're deploying OpenRefine (for instance in some Docker container), in which case what we're doing at the moment might not work. If you are aware of such a situation, perhaps it's worth describing the situation directly.
Looking at wget's man page, this environment variable is indeed used to specify a list of hosts for which we shouldn't use a proxy. According to the same manual, the --no-proxy option of wget does not expect any argument, so it disables the HTTP proxy entirely for any host, I suspect. So if we wanted to use that, we'd rather do wget --no-proxy -O - $URL > /dev/null 2>&1

Also, as a side note, I'm really happy that you are so active again in the project (In case you are interested to attend the BarCamp happening this month, don't forget to register, even if you only plan to attend remotely)

ostephens · June 6, 2024, 10:28am

I guess I was wondering - what happened in this case if the user had set ${REFINE_HOST_INTERNAL} to be localhost. Would this work still or would this then use the proxy (and is using the proxy the right or wrong thing to do in this scenario)

Ah - I hadn't realised that was an option (I'm used to using EXPORT rather than setting in the single line, but I can see here you'd need to unset after and the method used avoids this)

ostephens · June 6, 2024, 10:33am

Thanks Antonin - trying to be a bit more active here although sometimes struggling to find the time (I'm still using OR daily and doing training here and there!)

Attending the barcamp in person is a dream but I've registered for remote attendance now - thanks for the nudge

thadguidry · June 6, 2024, 1:05pm

If there's a system proxy, OpenRefine should use it automatically in my opinion.
A system proxy is highest, and so all downwind applications should respect "system" settings, and "user" settings.
Usually, it's just about reading the ENV variable, which both curl and wget can do for us in the refine script if properly coded.

Sometimes the proxy is a hard thing to discover on some OS's. But luckily the world has embraced those ENV variables (Java included - which only used to understand the uppercase variables, but I think most HTTP client libraries recognize both upper and lowercase proxy ENV variables now. "I think")

ostephens · June 12, 2024, 1:13pm

If there's a system proxy, OpenRefine should use it automatically in my opinion.

Isn't the point of this code to not go through the system proxy when trying to access a local address?

My question is, if this is the point, why does the code only target 127.0.0.1 and not localhost? Is that a correct decision for some reason, or are we overlooking something?

thadguidry · June 12, 2024, 1:39pm

@ostephens I was talking about external access (Fetch URLs, etc.) and that I'd like to see OpenRefine automatically use the system configured proxy when doing those operations.

For the check_running() function in the script, Yes, the point of the code is that when trying to access 127.0.0.1 IP address to see if OpenRefine is running, the code should tell OpenRefine NOT to go through the system proxy, because there's no need.

"localhost" is the hostname for the 127.0.0.1 address.
So, the code targets the IP address 127.0.0.1 because it's an IP and not a hostname.
In Windows, the hostname had been traditionally set in C:\Windows\System32\Drivers\etc\hosts
But now (last 15 years) localhost name resolution is handled within DNS itself (look at the comment by Microsoft in my /etc/hosts file below).
My /etc/hosts on Windows 11

ostephens · June 12, 2024, 1:57pm

Thanks @thadguidry. I understand that localhost usually points at 127.0.0.1 (although I think it can be anything in the 127.*.*.* range?) but I've always specifed 127.0.0.1 and localhost separately in things like browser proxy settings and I wondered if that was also needed here?

(My habit of specifying both is very long standing and quite possibly just I was told this once many years ago and I just copied what I was told and never questioned it!)

ostephens · June 12, 2024, 2:02pm

I think this is done for 3.8? Handle proxy configuration, closes #5476 (#5477) · OpenRefine/OpenRefine@89e6dbe · GitHub

thadguidry · June 12, 2024, 2:09pm

You might be thinking instead of 0.0.0.0 and its relationship to "any adapters" or in other words INADDR_ANY in TCP/IP.
No, we don't need to account for any hostname, that would be something that a server however "might" need to do if OpenRefine was being hosted somewhere besides locally.

ostephens · June 12, 2024, 2:48pm

So on my mac

http_proxy=http://example.org curl -vI --noproxy 127.0.0.1 -s -S -f http://localhost:3333/
* Uses proxy env variable http_proxy == 'http://example.org'
* Host example.org:1080 was resolved.
* IPv6: (none)
* IPv4: 93.184.215.14
*   Trying 93.184.215.14:1080...

From my reading of the refine script is that this (with the exception of the -vI flags) is the sort of request that could be made. So here the --noproxy 127.0.0.1 flag is not working to avoid the proxy because we are calling localhost

Note earlier in the script we have

if [ "$REFINE_HOST" = '*' ] ; then
    echo No host specified while binding to interface 0.0.0.0, guessing localhost.
    REFINE_HOST_INTERNAL="localhost"
else
    REFINE_HOST_INTERNAL="$REFINE_HOST"
fi

indicating that localhost is used as a fallback for the REFINE_HOST_INTERNAL variable then used in check_running() in some scenarios

If I specify localhost in my noproxy flag I see what I'd hope for:

http_proxy=http://example.org curl -vI --noproxy localhost -s -S -f http://localhost:3333/
* Host localhost:3333 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:3333...
* connect to ::1 port 3333 from ::1 port 60638 failed: Connection refused
*   Trying 127.0.0.1:3333...
* Connected to localhost (127.0.0.1) port 3333
> HEAD / HTTP/1.1
> Host: localhost:3333
> User-Agent: curl/8.6.0
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Date: Wed, 12 Jun 2024 14:51:01 GMT
Date: Wed, 12 Jun 2024 14:51:01 GMT
< Set-Cookie: host=.butterfly; Path=/
Set-Cookie: host=.butterfly; Path=/
< Expires: Thu, 01 Jan 1970 00:00:00 GMT
Expires: Thu, 01 Jan 1970 00:00:00 GMT
< Content-Type: text/html;charset=utf-8
Content-Type: text/html;charset=utf-8
< Transfer-Encoding: chunked
Transfer-Encoding: chunked

<
* Connection #0 to host localhost left intact

ostephens · June 13, 2024, 8:28am

Based on the outcome of this test I'm reasonably certain we need to include localhost in the noproxy list even though it resolves to 127.0.0.1. Issue raised When checking for a running open refine localhost should be included in the no proxy list · Issue #6673 · OpenRefine/OpenRefine · GitHub

Topic		Replies	Views
Error: OpenRefine isn't running on http://127.0.0.1:3334/. Maybe a proxy issue? Support and Helpdesk	8	454	June 8, 2024
Exposing OpenRefine to other machines on my LAN Support and Helpdesk	3	216	February 1, 2024
OpenRefine can't be reached Support and Helpdesk	3	44	February 2, 2025
I run into a “Malformed reply from SOCKS server” error message in the command prompt Support and Helpdesk wikidata	4	827	February 27, 2023
Login to Wikimedia Commons fails Support and Helpdesk wikimedia-commons	3	332	June 30, 2023

Code for checking running OpenRefine

Related topics