On Fri, Jan 23, 2015 at 4:15 AM, Ananth RK <ananthrk@ymxdata.com> wrote:

Hi Christian,

On Thu, Jan 22, 2015 at 10:48 PM, <christian@quelltextlich.at> wrote:

> Okay. Assuming we append client_ip to X-Forwarded-For and process the list
> from right-to-left, do you still think we should explicitly check whether
> client_ip is a trusted proxy? Because, per the modified algorithm, the
> client_ip will be the first item to be processed and if it is not a trusted
> proxy, it will be returned as the result - which is what even the explicit
> initial check will result in this case as well. Correct?

Right, if you implement the pseudo-code algorithm I sketched, you can
skip that step.

But if you optimize and loop-unroll the first iteration (like your
implementation did), you want to check whether or not it's a trusted
proxy.
But whether one would label that loop-unrolling “premature
optimization” or “making the code more readable” depends a bit on the
concrete formulation in code.

Okay.

> > The way you combine the looping and if looks like you might be
> > skipping across entries that you cannot parse.
> > But if you find invalid entries while backtracking the IPs, you're
> > probably in parts of the X-Forwarded-For value that you shouldn't trust.
>
> > (X-Forwarded-For can easily be spoofed by clients)
>
> We only check that each entry is a valid IPv4 or IPv6 address and not just
> a random string.

> If we find invalid entries, then the only option we have
> is to continue the search further for a valid IP.

Ex falso quodlibet.

(Oh Latin! How much I missed you!)

Hence, the only option one has is to stop at untrusted values in the
chain of X-Forwarded-For IPs.

And pick the last known good IP?

> > list_of_ips = append client_ip to X-Forwarded-For
> > for ip in reverse( list_of_ips )
> > do
> > if ip is not a trusted proxy or the iterator does not have more
> elements
> > then
> > return ip
> > fi
> > done
>
> > (That should give invalid IP addresses for some requests. That
> > seems to be the correct thing from my point of view. But if you rather
> > geolocate wrong than not geolocate at all, you can choose to pick the
> > last known good IP address instead. I guess that's more a matter of
> > taste.)
>
> What would picking the last known good IP address in this context? An IP
> that matches the IPv4/IPv6 pattern _AND_ is not a trusted proxy?

As indicated above, I'd rather indicate “Unable to geolocate” than
knowingly geolocate wrong.

But by the “last known good IP address” above I meant (in case of
faulty X-Forwarded-For), the last IP address that was not obviously
wrong could be picked.

Let's have some examples.
For the examples, I am assuming that only 1.0.0.0/8 are trusted
proxies.

* Case 1:

Client IP: 2.0.0.1
X-Forwarded-For: 1.0.0.2, 1.0.0.3

-> last known good IP address: “2.0.0.1”

* Case 2:

Client IP: 1.0.0.1
X-Forwarded-For: 1.0.0.2, 2.0.0.1

-> last known good IP address: “2.0.0.1”

* Case 3:

Client IP: 1.0.0.1
X-Forwarded-For: foo, 2.0.0.1

-> last known good IP address: “2.0.0.1”

* Case 4:

Client IP: 1.0.0.1
X-Forwarded-For: 1.0.0.2, 1.0.0.3, foo

-> last known good IP address: “1.0.0.1”

* Case 5:

Client IP: 1.0.0.1
X-Forwarded-For: 1.0.0.2, foo, 1.0.0.3

-> last known good IP address: “1.0.0.3”

* Case 6:

Client IP: 1.0.0.1
X-Forwarded-For: foo, 1.0.0.2, 1.0.0.3

-> last known good IP address: “1.0.0.2”

* Case 7:

Client IP: foo

-> This should never happen. File a bug in phabricator.

Thanks for the case-by-case explanation. It is *really* clear.

But let me repeat that I think “Unable to geolocate” is a valid and
good response. Better than using some proxy in the chain.

Hm...Let me check with Andrew on whether it is an acceptable behavior for the UDF.

Thanks,
Ananth

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics