I'm just trying to understand how the code that checks for a running OpenRefine instance works. In refine script we have a check_running() function:
check_running() {
check_downloaders
URL="http://${REFINE_HOST_INTERNAL}:${REFINE_PORT}/"
CHECK_STR="<title>OpenRefine</title>"
if [ "$CURL" ] ; then
curl --noproxy 127.0.0.1 -s -S -f $URL > /dev/null 2>&1
CURL_RETURN=$?
if [ $CURL_RETURN -eq "7" ] || [ $CURL_RETURN -eq "22" ] ; then
NOT_RUNNING="1"
fi
elif [ "$WGET" ] ; then
no_proxy=127.0.0.1 wget -O - $URL > /dev/null 2>&1
if [ "$?" = "4" ] ; then
NOT_RUNNING="1"
fi
fi
if [ -z "${NOT_RUNNING}" ] ; then
if [ "$CURL" ] ; then
RUNNING=`curl --noproxy 127.0.0.1 -s $URL | grep "$CHECK_STR"`
elif [ "$WGET" ] ; then
RUNNING=`no_proxy=127.0.0.1 wget -O - $URL| grep "$CHECK_STR"`
fi
if [ -z "${RUNNING}" ] ; then
error "OpenRefine isn't running on $URL. Maybe a proxy issue?"
fi
else
RUNNING=""
fi
}
I have a couple of questions:
Why is the --no-proxy flag applied to 127.0.0.1 rather than ${REFINE_HOST_INTERNAL} ?
Why is the wget command no_proxy=127.0.0.1 wget -O - $URL > /dev/null 2>&1 rather than wget --no-proxy 127.0.0.1 -O - $URL > /dev/null 2>&1?
I'm asking here before creating an issue because while these things don't look right to me, I'm not confident enough to immediately raise a bug report - so asking here first to check my understanding!
127.0.0.1 is the IP loopback address (on systems that use /etc/hosts it is sometimes mapped to "localhost" alias - but not always - Hi old Sun Systems!)
The no_proxy is an ENV variable assigned with a value of 127.0.0.1
When needed to have an ENV variable that some tools and commands expect to exist, it's often the case that you can simply set the ENV variable prior to executing the command. So... no_proxy=127.0.0.1 means don't use a proxy for the loopback address, and then we immediately execute wget ... with it's arguments.
For the wget case, it seems someone knows that wget supports environment variables for proxies Proxies (GNU Wget 1.24.5 Manual)
and so assign the ENV variable no_proxy prior to executing the wget command with arguments.
For curl case, you need to be careful...
because --noproxy allows to provide a list of hosts to tell curl that those hosts do NOT need to use a proxy to communicate through. And the --noproxy option overrides ENV variables no_proxy and NO_PROXY. https://curl.se/docs/manpage.html#--noproxy
Scroll down to near the bottom on the Curl man webpage to the Environment section and you'll understand all the ENV options that curl can understand (and sometimes override!) with some of its arguments.
My understanding is that we want to avoid going through the system's HTTP proxy if we are making a call to any local web service. In that sense, --no-proxy 127.0.0.1 is probably something that ought to be a default in most for curl and wget, but apparently isn't, so that's worth adding, I guess? If you are running OpenRefine to make it reachable from other machines, then the URL we'll have at that stage will be something that's resolvable from the outside and there is no problem with using the system's proxy to check it, I guess? Of course the specifics of networking and hostname resolution really depend on how you're deploying OpenRefine (for instance in some Docker container), in which case what we're doing at the moment might not work. If you are aware of such a situation, perhaps it's worth describing the situation directly.
Looking at wget's man page, this environment variable is indeed used to specify a list of hosts for which we shouldn't use a proxy. According to the same manual, the --no-proxy option of wget does not expect any argument, so it disables the HTTP proxy entirely for any host, I suspect. So if we wanted to use that, we'd rather do wget --no-proxy -O - $URL > /dev/null 2>&1
Also, as a side note, I'm really happy that you are so active again in the project (In case you are interested to attend the BarCamp happening this month, don't forget to register, even if you only plan to attend remotely)
I guess I was wondering - what happened in this case if the user had set ${REFINE_HOST_INTERNAL} to be localhost. Would this work still or would this then use the proxy (and is using the proxy the right or wrong thing to do in this scenario)
Ah - I hadn't realised that was an option (I'm used to using EXPORT rather than setting in the single line, but I can see here you'd need to unset after and the method used avoids this)
Thanks Antonin - trying to be a bit more active here although sometimes struggling to find the time (I'm still using OR daily and doing training here and there!)
Attending the barcamp in person is a dream but I've registered for remote attendance now - thanks for the nudge
If there's a system proxy, OpenRefine should use it automatically in my opinion.
A system proxy is highest, and so all downwind applications should respect "system" settings, and "user" settings.
Usually, it's just about reading the ENV variable, which both curl and wget can do for us in the refine script if properly coded.
Sometimes the proxy is a hard thing to discover on some OS's. But luckily the world has embraced those ENV variables (Java included - which only used to understand the uppercase variables, but I think most HTTP client libraries recognize both upper and lowercase proxy ENV variables now. "I think")
If there's a system proxy, OpenRefine should use it automatically in my opinion.
Isn't the point of this code to not go through the system proxy when trying to access a local address?
My question is, if this is the point, why does the code only target 127.0.0.1 and not localhost? Is that a correct decision for some reason, or are we overlooking something?
@ostephens I was talking about external access (Fetch URLs, etc.) and that I'd like to see OpenRefine automatically use the system configured proxy when doing those operations.
For the check_running() function in the script, Yes, the point of the code is that when trying to access 127.0.0.1 IP address to see if OpenRefine is running, the code should tell OpenRefine NOT to go through the system proxy, because there's no need.
"localhost" is the hostname for the 127.0.0.1 address.
So, the code targets the IP address 127.0.0.1 because it's an IP and not a hostname.
In Windows, the hostname had been traditionally set in C:\Windows\System32\Drivers\etc\hosts
But now (last 15 years) localhost name resolution is handled within DNS itself (look at the comment by Microsoft in my /etc/hosts file below).
My /etc/hosts on Windows 11
Thanks @thadguidry. I understand that localhost usually points at 127.0.0.1 (although I think it can be anything in the 127.*.*.* range?) but I've always specifed 127.0.0.1 and localhost separately in things like browser proxy settings and I wondered if that was also needed here?
(My habit of specifying both is very long standing and quite possibly just I was told this once many years ago and I just copied what I was told and never questioned it!)
You might be thinking instead of 0.0.0.0 and its relationship to "any adapters" or in other words INADDR_ANY in TCP/IP.
No, we don't need to account for any hostname, that would be something that a server however "might" need to do if OpenRefine was being hosted somewhere besides locally.
From my reading of the refine script is that this (with the exception of the -vI flags) is the sort of request that could be made. So here the --noproxy 127.0.0.1 flag is not working to avoid the proxy because we are calling localhost
Note earlier in the script we have
if [ "$REFINE_HOST" = '*' ] ; then
echo No host specified while binding to interface 0.0.0.0, guessing localhost.
REFINE_HOST_INTERNAL="localhost"
else
REFINE_HOST_INTERNAL="$REFINE_HOST"
fi
indicating that localhost is used as a fallback for the REFINE_HOST_INTERNAL variable then used in check_running() in some scenarios
If I specify localhost in my noproxy flag I see what I'd hope for:
http_proxy=http://example.org curl -vI --noproxy localhost -s -S -f http://localhost:3333/
* Host localhost:3333 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
* Trying [::1]:3333...
* connect to ::1 port 3333 from ::1 port 60638 failed: Connection refused
* Trying 127.0.0.1:3333...
* Connected to localhost (127.0.0.1) port 3333
> HEAD / HTTP/1.1
> Host: localhost:3333
> User-Agent: curl/8.6.0
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Date: Wed, 12 Jun 2024 14:51:01 GMT
Date: Wed, 12 Jun 2024 14:51:01 GMT
< Set-Cookie: host=.butterfly; Path=/
Set-Cookie: host=.butterfly; Path=/
< Expires: Thu, 01 Jan 1970 00:00:00 GMT
Expires: Thu, 01 Jan 1970 00:00:00 GMT
< Content-Type: text/html;charset=utf-8
Content-Type: text/html;charset=utf-8
< Transfer-Encoding: chunked
Transfer-Encoding: chunked
<
* Connection #0 to host localhost left intact