New subject: Better way to identify client IP address

22 Jan 2015

      Thanks for the detailed response.
...
That looks like (if X-Forwarded-For is “valid”) your jumping straight
to considering the X-Forwarded-For header. You might want to check
beforehand whether the client ip is a “trusted” proxy. And only if it
is, you should walk the X-Forwarded-For.
...
Also, it sounds a bit like you're going through X-Forwarded-For from
left to right. Make sure to walk it from right to left, as proxies are
expected to append (not prepend) the client IP they forward for.
Okay. Assuming we append client_ip to X-Forwarded-For and process the list
from right-to-left, do you still think we should explicitly check whether
client_ip is a trusted proxy? Because, per the modified algorithm, the
client_ip will be the first item to be processed and if it is not a trusted
proxy, it will be returned as the result - which is what even the explicit
initial check will result in this case as well. Correct?
...
The way you combine the looping and if looks like you might be
skipping across entries that you cannot parse.
But if you find invalid entries while backtracking the IPs, you're
probably in parts of the X-Forwarded-For value that you shouldn't trust.
...
(X-Forwarded-For can easily be spoofed by clients)
We only check that each entry is a valid IPv4 or IPv6 address and not just
a random string. If we find invalid entries, then the only option we have
is to continue the search further for a valid IP.
...
list_of_ips = append client_ip to X-Forwarded-For
  for ip in reverse( list_of_ips )
  do
      if ip is not a trusted proxy or the iterator does not have more
elements
...
  then
      return ip
  fi

done
...
(That should give invalid IP addresses for some requests. That
seems to be the correct thing from my point of view. But if you rather
geolocate wrong than not geolocate at all, you can choose to pick the
last known good IP address instead. I guess that's more a matter of
taste.)
What would picking the last known good IP address in this context? An IP
that matches the IPv4/IPv6 pattern _AND_ is not a trusted proxy?
...
Different stakeholders at the WMF use different lists of what kind of
proxies they consider. So things are not really clear cut.
Compiling a common list - may be union of "General", "MediaWiki" and
"Wikipedia Zero" - would be very useful. As Oliver suggests, it would be
good to have them in site-wide config files.
Thanks,
Ananth

Re: [Analytics] Better way to identify client IP address