Hi,
[ I discussed this issue with Ironholds some weeks back. Also he reached out to Ops about this some more weeks back. So I guess he can chime in with more details. Only chiming it with my half-knowledge, as kevinator asked me to ]
Warning: Longish-email :-/
On Tue, Jan 20, 2015 at 10:33:01PM +0530, Ananth RK wrote:
IF X-Forwarded-For value is not valid: return remote_host ELSE: FOR EACH valid IP address, proxy_ip, in the comma-separated X-Forwarded-For value:
That looks like (if X-Forwarded-For is “valid”) your jumping straight to considering the X-Forwarded-For header. You might want to check beforehand whether the client ip is a “trusted” proxy. And only if it is, you should walk the X-Forwarded-For.
Also, it sounds a bit like you're going through X-Forwarded-For from left to right. Make sure to walk it from right to left, as proxies are expected to append (not prepend) the client IP they forward for.
IF proxy_ip does not start with "127.0" or "192.168" or "10."
or "169.254": return proxy_ip
The way you combine the looping and if looks like you might be skipping across entries that you cannot parse. But if you find invalid entries while backtracking the IPs, you're probably in parts of the X-Forwarded-For value that you shouldn't trust.
(X-Forwarded-For can easily be spoofed by clients)
What are the ways in which this naive algorithm can be improved?
Handwavy pseudo-code only to illustrate the general pattern:
list_of_ips = append client_ip to X-Forwarded-For for ip in reverse( list_of_ips ) do if ip is not a trusted proxy or the iterator does not have more elements then return ip fi done
(That should give invalid IP addresses for some requests. That seems to be the correct thing from my point of view. But if you rather geolocate wrong than not geolocate at all, you can choose to pick the last known good IP address instead. I guess that's more a matter of taste.)
For example, is it better to maintain a separate list of IP address to ignore (currently only 4)? If yes, how do we ensure that the list is exhaustive? Any other improvements?
Different stakeholders at the WMF use different lists of what kind of proxies they consider. So things are not really clear cut.
(Do not treat the below list as authorative or complete. It's merely a collection of the pieces I know. And I did not check since quite some time. So it might be horribly outdated.)
* General
In general, I'd say one can assume that WMF-servers do not append bogus data to X-Forwarded-For. So when backtracking IPs, you can consider the hosts in $all_networks from https://git.wikimedia.org/blob/operations%2Fpuppet.git/9f97e3c2c5bc012ba5c37... as proxy, if they claim to have forwarded the request.
One might handle Labs special in there. Not sure.
* MediaWiki
MediaWiki considers $wgSquidServersNoPurge https://git.wikimedia.org/blob/operations%2Fmediawiki-config.git/af1cc64bb55... and https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FTrustedXFF/ec180fbd7... as proxies.
* Wikipedia Zero
Wikipedia Zero does not follow the MediaWiki pattern when tagging, but is stripping WMF servers and considers sets of IP addresses from
https://zero.wikimedia.org/wiki/Zero:-NOKIAPROD https://zero.wikimedia.org/wiki/Zero:-NOKIAQA https://zero.wikimedia.org/wiki/Zero:-OPERA
(that's a private wiki. :-/ ) as proxies.
* Legacy analytics software
Legacy analytics software often used hardcoded lists of IP addresses they considered proxies. Those lists became stale quickly. I'd just ignore legacy software and not model what they modeled.
------------------------------------------------------------
If you want for a simple solution that at least allows determining the IP that made the request to the WMF servers, and you do not need to model a behaviour of some given component, go for “General”.
If you want to mimic MediaWiki as closely as possible, go for the “MediaWiki” item.
If you care more about Wikipedia Zero, and carrier/geotagging there, go for the “Wikipedia Zero” item.
If you are looking for a general purpose, versatile approach, add a parameter that allows to select between the different strategies to resolve X-Forwarded-For.
Yes, that sucks :-/
Have fun, Christian
P.S.: Note that the repositories that the above links point to are not fully static. While their proxy configuration does not change every other day, they do change every now and then.