We currently have a method in Geo UDF that takes an IP address from the remote host header and the X-Forwarded-For value and attempts to identify the originating client IP address by following a simple algorithm as follows:
GetClientIP(remote_host, X-Forwarded-For)
IF X-Forwarded-For value is not valid: return remote_host ELSE: FOR EACH valid IP address, proxy_ip, in the comma-separated X-Forwarded-For value: IF proxy_ip does not start with "127.0" or "192.168" or "10." or "169.254": return proxy_ip
What are the ways in which this naive algorithm can be improved? For example, is it better to maintain a separate list of IP address to ignore (currently only 4)? If yes, how do we ensure that the list is exhaustive? Any other improvements?
Thanks, Ananth
Hi,
[ I discussed this issue with Ironholds some weeks back. Also he reached out to Ops about this some more weeks back. So I guess he can chime in with more details. Only chiming it with my half-knowledge, as kevinator asked me to ]
Warning: Longish-email :-/
On Tue, Jan 20, 2015 at 10:33:01PM +0530, Ananth RK wrote:
IF X-Forwarded-For value is not valid: return remote_host ELSE: FOR EACH valid IP address, proxy_ip, in the comma-separated X-Forwarded-For value:
That looks like (if X-Forwarded-For is “valid”) your jumping straight to considering the X-Forwarded-For header. You might want to check beforehand whether the client ip is a “trusted” proxy. And only if it is, you should walk the X-Forwarded-For.
Also, it sounds a bit like you're going through X-Forwarded-For from left to right. Make sure to walk it from right to left, as proxies are expected to append (not prepend) the client IP they forward for.
IF proxy_ip does not start with "127.0" or "192.168" or "10."
or "169.254": return proxy_ip
The way you combine the looping and if looks like you might be skipping across entries that you cannot parse. But if you find invalid entries while backtracking the IPs, you're probably in parts of the X-Forwarded-For value that you shouldn't trust.
(X-Forwarded-For can easily be spoofed by clients)
What are the ways in which this naive algorithm can be improved?
Handwavy pseudo-code only to illustrate the general pattern:
list_of_ips = append client_ip to X-Forwarded-For for ip in reverse( list_of_ips ) do if ip is not a trusted proxy or the iterator does not have more elements then return ip fi done
(That should give invalid IP addresses for some requests. That seems to be the correct thing from my point of view. But if you rather geolocate wrong than not geolocate at all, you can choose to pick the last known good IP address instead. I guess that's more a matter of taste.)
For example, is it better to maintain a separate list of IP address to ignore (currently only 4)? If yes, how do we ensure that the list is exhaustive? Any other improvements?
Different stakeholders at the WMF use different lists of what kind of proxies they consider. So things are not really clear cut.
(Do not treat the below list as authorative or complete. It's merely a collection of the pieces I know. And I did not check since quite some time. So it might be horribly outdated.)
* General
In general, I'd say one can assume that WMF-servers do not append bogus data to X-Forwarded-For. So when backtracking IPs, you can consider the hosts in $all_networks from https://git.wikimedia.org/blob/operations%2Fpuppet.git/9f97e3c2c5bc012ba5c37... as proxy, if they claim to have forwarded the request.
One might handle Labs special in there. Not sure.
* MediaWiki
MediaWiki considers $wgSquidServersNoPurge https://git.wikimedia.org/blob/operations%2Fmediawiki-config.git/af1cc64bb55... and https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FTrustedXFF/ec180fbd7... as proxies.
* Wikipedia Zero
Wikipedia Zero does not follow the MediaWiki pattern when tagging, but is stripping WMF servers and considers sets of IP addresses from
https://zero.wikimedia.org/wiki/Zero:-NOKIAPROD https://zero.wikimedia.org/wiki/Zero:-NOKIAQA https://zero.wikimedia.org/wiki/Zero:-OPERA
(that's a private wiki. :-/ ) as proxies.
* Legacy analytics software
Legacy analytics software often used hardcoded lists of IP addresses they considered proxies. Those lists became stale quickly. I'd just ignore legacy software and not model what they modeled.
------------------------------------------------------------
If you want for a simple solution that at least allows determining the IP that made the request to the WMF servers, and you do not need to model a behaviour of some given component, go for “General”.
If you want to mimic MediaWiki as closely as possible, go for the “MediaWiki” item.
If you care more about Wikipedia Zero, and carrier/geotagging there, go for the “Wikipedia Zero” item.
If you are looking for a general purpose, versatile approach, add a parameter that allows to select between the different strategies to resolve X-Forwarded-For.
Yes, that sucks :-/
Have fun, Christian
P.S.: Note that the repositories that the above links point to are not fully static. While their proxy configuration does not change every other day, they do change every now and then.
We really ought to move proxy ip ranges to meta or some other wiki. Suggestions?
On Tue, Jan 20, 2015 at 4:44 PM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
[ I discussed this issue with Ironholds some weeks back. Also he reached out to Ops about this some more weeks back. So I guess he can chime in with more details. Only chiming it with my half-knowledge, as kevinator asked me to ]
Warning: Longish-email :-/
On Tue, Jan 20, 2015 at 10:33:01PM +0530, Ananth RK wrote:
IF X-Forwarded-For value is not valid: return remote_host ELSE: FOR EACH valid IP address, proxy_ip, in the comma-separated X-Forwarded-For value:
That looks like (if X-Forwarded-For is “valid”) your jumping straight to considering the X-Forwarded-For header. You might want to check beforehand whether the client ip is a “trusted” proxy. And only if it is, you should walk the X-Forwarded-For.
Also, it sounds a bit like you're going through X-Forwarded-For from left to right. Make sure to walk it from right to left, as proxies are expected to append (not prepend) the client IP they forward for.
IF proxy_ip does not start with "127.0" or "192.168" or "10."
or "169.254": return proxy_ip
The way you combine the looping and if looks like you might be skipping across entries that you cannot parse. But if you find invalid entries while backtracking the IPs, you're probably in parts of the X-Forwarded-For value that you shouldn't trust.
(X-Forwarded-For can easily be spoofed by clients)
What are the ways in which this naive algorithm can be improved?
Handwavy pseudo-code only to illustrate the general pattern:
list_of_ips = append client_ip to X-Forwarded-For for ip in reverse( list_of_ips ) do if ip is not a trusted proxy or the iterator does not have more elements then return ip fi done
(That should give invalid IP addresses for some requests. That seems to be the correct thing from my point of view. But if you rather geolocate wrong than not geolocate at all, you can choose to pick the last known good IP address instead. I guess that's more a matter of taste.)
For example, is it better to maintain a separate list of IP address to ignore (currently only 4)? If yes, how do we ensure that the list is exhaustive? Any other improvements?
Different stakeholders at the WMF use different lists of what kind of proxies they consider. So things are not really clear cut.
(Do not treat the below list as authorative or complete. It's merely a collection of the pieces I know. And I did not check since quite some time. So it might be horribly outdated.)
- General
In general, I'd say one can assume that WMF-servers do not append bogus data to X-Forwarded-For. So when backtracking IPs, you can consider the hosts in $all_networks from
https://git.wikimedia.org/blob/operations%2Fpuppet.git/9f97e3c2c5bc012ba5c37... as proxy, if they claim to have forwarded the request.
One might handle Labs special in there. Not sure.
- MediaWiki
MediaWiki considers $wgSquidServersNoPurge
https://git.wikimedia.org/blob/operations%2Fmediawiki-config.git/af1cc64bb55... and
https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FTrustedXFF/ec180fbd7... as proxies.
- Wikipedia Zero
Wikipedia Zero does not follow the MediaWiki pattern when tagging, but is stripping WMF servers and considers sets of IP addresses from
https://zero.wikimedia.org/wiki/Zero:-NOKIAPROD https://zero.wikimedia.org/wiki/Zero:-NOKIAQA https://zero.wikimedia.org/wiki/Zero:-OPERA
(that's a private wiki. :-/ ) as proxies.
- Legacy analytics software
Legacy analytics software often used hardcoded lists of IP addresses they considered proxies. Those lists became stale quickly. I'd just ignore legacy software and not model what they modeled.
If you want for a simple solution that at least allows determining the IP that made the request to the WMF servers, and you do not need to model a behaviour of some given component, go for “General”.
If you want to mimic MediaWiki as closely as possible, go for the “MediaWiki” item.
If you care more about Wikipedia Zero, and carrier/geotagging there, go for the “Wikipedia Zero” item.
If you are looking for a general purpose, versatile approach, add a parameter that allows to select between the different strategies to resolve X-Forwarded-For.
Yes, that sucks :-/
Have fun, Christian
P.S.: Note that the repositories that the above links point to are not fully static. While their proxy configuration does not change every other day, they do change every now and then.
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I'd actually suggest just moving them to the site-wide config files we use for all other recognised proxies.
On 20 January 2015 at 20:02, Yuri Astrakhan yastrakhan@wikimedia.org wrote:
We really ought to move proxy ip ranges to meta or some other wiki. Suggestions?
On Tue, Jan 20, 2015 at 4:44 PM, Christian Aistleitner christian@quelltextlich.at wrote:
Hi,
[ I discussed this issue with Ironholds some weeks back. Also he reached out to Ops about this some more weeks back. So I guess he can chime in with more details. Only chiming it with my half-knowledge, as kevinator asked me to ]
Warning: Longish-email :-/
On Tue, Jan 20, 2015 at 10:33:01PM +0530, Ananth RK wrote:
IF X-Forwarded-For value is not valid: return remote_host ELSE: FOR EACH valid IP address, proxy_ip, in the comma-separated X-Forwarded-For value:
That looks like (if X-Forwarded-For is “valid”) your jumping straight to considering the X-Forwarded-For header. You might want to check beforehand whether the client ip is a “trusted” proxy. And only if it is, you should walk the X-Forwarded-For.
Also, it sounds a bit like you're going through X-Forwarded-For from left to right. Make sure to walk it from right to left, as proxies are expected to append (not prepend) the client IP they forward for.
IF proxy_ip does not start with "127.0" or "192.168" or "10."
or "169.254": return proxy_ip
The way you combine the looping and if looks like you might be skipping across entries that you cannot parse. But if you find invalid entries while backtracking the IPs, you're probably in parts of the X-Forwarded-For value that you shouldn't trust.
(X-Forwarded-For can easily be spoofed by clients)
What are the ways in which this naive algorithm can be improved?
Handwavy pseudo-code only to illustrate the general pattern:
list_of_ips = append client_ip to X-Forwarded-For for ip in reverse( list_of_ips ) do if ip is not a trusted proxy or the iterator does not have more elements then return ip fi done
(That should give invalid IP addresses for some requests. That seems to be the correct thing from my point of view. But if you rather geolocate wrong than not geolocate at all, you can choose to pick the last known good IP address instead. I guess that's more a matter of taste.)
For example, is it better to maintain a separate list of IP address to ignore (currently only 4)? If yes, how do we ensure that the list is exhaustive? Any other improvements?
Different stakeholders at the WMF use different lists of what kind of proxies they consider. So things are not really clear cut.
(Do not treat the below list as authorative or complete. It's merely a collection of the pieces I know. And I did not check since quite some time. So it might be horribly outdated.)
- General
In general, I'd say one can assume that WMF-servers do not append bogus data to X-Forwarded-For. So when backtracking IPs, you can consider the hosts in $all_networks from
https://git.wikimedia.org/blob/operations%2Fpuppet.git/9f97e3c2c5bc012ba5c37... as proxy, if they claim to have forwarded the request.
One might handle Labs special in there. Not sure.
- MediaWiki
MediaWiki considers $wgSquidServersNoPurge
https://git.wikimedia.org/blob/operations%2Fmediawiki-config.git/af1cc64bb55... and
https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FTrustedXFF/ec180fbd7... as proxies.
- Wikipedia Zero
Wikipedia Zero does not follow the MediaWiki pattern when tagging, but is stripping WMF servers and considers sets of IP addresses from
https://zero.wikimedia.org/wiki/Zero:-NOKIAPROD https://zero.wikimedia.org/wiki/Zero:-NOKIAQA https://zero.wikimedia.org/wiki/Zero:-OPERA
(that's a private wiki. :-/ ) as proxies.
- Legacy analytics software
Legacy analytics software often used hardcoded lists of IP addresses they considered proxies. Those lists became stale quickly. I'd just ignore legacy software and not model what they modeled.
If you want for a simple solution that at least allows determining the IP that made the request to the WMF servers, and you do not need to model a behaviour of some given component, go for “General”.
If you want to mimic MediaWiki as closely as possible, go for the “MediaWiki” item.
If you care more about Wikipedia Zero, and carrier/geotagging there, go for the “Wikipedia Zero” item.
If you are looking for a general purpose, versatile approach, add a parameter that allows to select between the different strategies to resolve X-Forwarded-For.
Yes, that sucks :-/
Have fun, Christian
P.S.: Note that the repositories that the above links point to are not fully static. While their proxy configuration does not change every other day, they do change every now and then.
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Oliver Keyes, 21/01/2015 02:17:
I'd actually suggest just moving them to the site-wide config files we use for all other recognised proxies.
If they aren't Wikimedia projects specific, there is no reason for them to be in WMF configuration. To the contrary, there are ample precedents for such things to be managed on Meta-Wiki (spam blacklist, titleblacklist, interwiki map etc.) even for extensions bundled with core. On the other hand, it would clearly better to outsource the maintenance of such lists to a generic provider, as we did with TorBlock.
Nemo
I don't grok that message, I'm afraid; there's a config list of "these IPs are IPs we accept as trusted XFF providers": it's used in CheckUser and similar tools to avoid ending up with everyone geolocating to Virginia. I'm suggesting zero IP ranges that are providing XFFs get added to that so that there's only one thing to maintain and only one thing to check.
On 22 January 2015 at 09:39, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Oliver Keyes, 21/01/2015 02:17:
I'd actually suggest just moving them to the site-wide config files we use for all other recognised proxies.
If they aren't Wikimedia projects specific, there is no reason for them to be in WMF configuration. To the contrary, there are ample precedents for such things to be managed on Meta-Wiki (spam blacklist, titleblacklist, interwiki map etc.) even for extensions bundled with core. On the other hand, it would clearly better to outsource the maintenance of such lists to a generic provider, as we did with TorBlock.
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Please keep in mind that sometimes originating client IP in XFF header may be non-routable (10.* or 192.168.*)
On Thu, Jan 22, 2015 at 8:35 AM, Oliver Keyes okeyes@wikimedia.org wrote:
I don't grok that message, I'm afraid; there's a config list of "these IPs are IPs we accept as trusted XFF providers": it's used in CheckUser and similar tools to avoid ending up with everyone geolocating to Virginia. I'm suggesting zero IP ranges that are providing XFFs get added to that so that there's only one thing to maintain and only one thing to check.
On 22 January 2015 at 09:39, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Oliver Keyes, 21/01/2015 02:17:
I'd actually suggest just moving them to the site-wide config files we use for all other recognised proxies.
If they aren't Wikimedia projects specific, there is no reason for them
to
be in WMF configuration. To the contrary, there are ample precedents for such things to be managed on Meta-Wiki (spam blacklist, titleblacklist, interwiki map etc.) even for extensions bundled with core. On the other hand, it would clearly better to outsource the maintenance of such lists
to
a generic provider, as we did with TorBlock.
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
IF proxy_ip does not start with "127.0" or "192.168" or "10."
or "169.254":
There are other special/private IP blocks that you'll probably want filtered. This RFC contains the full list for IPv4: https://tools.ietf.org/html/rfc5735#section-4 And the equivalent for IPv6: https://tools.ietf.org/html/rfc5156
On Tue, Jan 20, 2015 at 9:03 AM, Ananth RK ananthrk@ymxdata.com wrote:
We currently have a method in Geo UDF that takes an IP address from the remote host header and the X-Forwarded-For value and attempts to identify the originating client IP address by following a simple algorithm as follows:
GetClientIP(remote_host, X-Forwarded-For)
IF X-Forwarded-For value is not valid: return remote_host ELSE: FOR EACH valid IP address, proxy_ip, in the comma-separated X-Forwarded-For value: IF proxy_ip does not start with "127.0" or "192.168" or "10." or "169.254": return proxy_ip
What are the ways in which this naive algorithm can be improved? For example, is it better to maintain a separate list of IP address to ignore (currently only 4)? If yes, how do we ensure that the list is exhaustive? Any other improvements?
Thanks, Ananth
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Fri, Jan 23, 2015 at 11:29 PM, Gilles Dubuc gilles@wikimedia.org wrote:
IF proxy_ip does not start with "127.0" or "192.168" or "10."
or "169.254":
There are other special/private IP blocks that you'll probably want filtered. This RFC contains the full list for IPv4: https://tools.ietf.org/html/rfc5735#section-4 And the equivalent for IPv6: https://tools.ietf.org/html/rfc5156
Thanks. This is a very useful reference.