We really ought to move proxy ip ranges to meta or some other wiki. Suggestions?

On Tue, Jan 20, 2015 at 4:44 PM, Christian Aistleitner <christian@quelltextlich.at> wrote:
Hi,

[ I discussed this issue with Ironholds some weeks back.
  Also he reached out to Ops about this some more weeks back.
  So I guess he can chime in with more details.
  Only chiming it with my half-knowledge, as kevinator asked me to ]


Warning: Longish-email :-/


On Tue, Jan 20, 2015 at 10:33:01PM +0530, Ananth RK wrote:
> IF X-Forwarded-For value is not valid:
>     return remote_host
> ELSE:
>     FOR EACH valid IP address, proxy_ip, in the comma-separated
> X-Forwarded-For value:

That looks like (if X-Forwarded-For is “valid”) your jumping straight
to considering the X-Forwarded-For header. You might want to check
beforehand whether the client ip is a “trusted” proxy. And only if it
is, you should walk the X-Forwarded-For.

Also, it sounds a bit like you're going through X-Forwarded-For from
left to right. Make sure to walk it from right to left, as proxies are
expected to append (not prepend) the client IP they forward for.



>     IF proxy_ip does not start with "127.0" or "192.168" or "10."
> or "169.254":
>         return proxy_ip

The way you combine the looping and if looks like you might be
skipping across entries that you cannot parse.
But if you find invalid entries while backtracking the IPs, you're
probably in parts of the X-Forwarded-For value that you shouldn't trust.

(X-Forwarded-For can easily be spoofed by clients)



> What are the ways in which this naive algorithm can be improved?

Handwavy pseudo-code only to illustrate the general pattern:

  list_of_ips = append client_ip to X-Forwarded-For
  for ip in reverse( list_of_ips )
  do
      if ip is not a trusted proxy or the iterator does not have more elements
      then
          return ip
      fi
  done

(That should give invalid IP addresses for some requests. That
seems to be the correct thing from my point of view. But if you rather
geolocate wrong than not geolocate at all, you can choose to pick the
last known good IP address instead. I guess that's more a matter of
taste.)



> For
> example, is it better to maintain a separate list of IP address to ignore
> (currently only 4)? If yes, how do we ensure that the list is exhaustive?
> Any other improvements?

Different stakeholders at the WMF use different lists of what kind of
proxies they consider. So things are not really clear cut.

(Do not treat the below list as authorative or complete. It's merely a
collection of the pieces I know. And I did not check since quite some
time. So it might be horribly outdated.)



* General

In general, I'd say one can assume that WMF-servers do not append
bogus data to X-Forwarded-For. So when backtracking IPs, you can
consider the hosts in $all_networks from
  https://git.wikimedia.org/blob/operations%2Fpuppet.git/9f97e3c2c5bc012ba5c3751f13fd838a06d6528d/manifests%2Fnetwork.pp#L14
as proxy, if they claim to have forwarded the request.

One might handle Labs special in there. Not sure.



* MediaWiki

MediaWiki considers $wgSquidServersNoPurge
  https://git.wikimedia.org/blob/operations%2Fmediawiki-config.git/af1cc64bb5521d3c158a779e294f67d5841a9a0f/wmf-config%2Fsquid.php#L12
and
  https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FTrustedXFF/ec180fbd7509330972bb1319ec9b9dd834b26266/trusted-hosts.txt
as proxies.



* Wikipedia Zero

Wikipedia Zero does not follow the MediaWiki pattern when tagging, but
is stripping WMF servers and considers sets of IP addresses from

  https://zero.wikimedia.org/wiki/Zero:-NOKIAPROD
  https://zero.wikimedia.org/wiki/Zero:-NOKIAQA
  https://zero.wikimedia.org/wiki/Zero:-OPERA

(that's a private wiki. :-/ ) as proxies.



* Legacy analytics software

Legacy analytics software often used hardcoded lists of IP addresses
they considered proxies.
Those lists became stale quickly.
I'd just ignore legacy software and not model what they modeled.

------------------------------------------------------------

If you want for a simple solution that at least allows determining the
IP that made the request to the WMF servers, and you do not need to
model a behaviour of some given component, go for “General”.

If you want to mimic MediaWiki as closely as possible, go for the
“MediaWiki” item.

If you care more about Wikipedia Zero, and carrier/geotagging there,
go for the “Wikipedia Zero” item.

If you are looking for a general purpose, versatile approach, add a
parameter that allows to select between the different strategies to
resolve X-Forwarded-For.

Yes, that sucks :-/

Have fun,
Christian



P.S.: Note that the repositories that the above links point to are not
fully static. While their proxy configuration does not change every
other day, they do change every now and then.



--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
                           Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3     Email:  christian@quelltextlich.at
4293 Gutau, Austria          Phone:          +43 7946 / 20 5 81
                             Fax:            +43 7946 / 20 5 81
                             Homepage: http://quelltextlich.at/
---------------------------------------------------------------

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics