Thanks for the detailed response.
That looks like (if X-Forwarded-For is “valid”) your jumping straight to considering the X-Forwarded-For header. You might want to check beforehand whether the client ip is a “trusted” proxy. And only if it is, you should walk the X-Forwarded-For.
Also, it sounds a bit like you're going through X-Forwarded-For from left to right. Make sure to walk it from right to left, as proxies are expected to append (not prepend) the client IP they forward for.
Okay. Assuming we append client_ip to X-Forwarded-For and process the list from right-to-left, do you still think we should explicitly check whether client_ip is a trusted proxy? Because, per the modified algorithm, the client_ip will be the first item to be processed and if it is not a trusted proxy, it will be returned as the result - which is what even the explicit initial check will result in this case as well. Correct?
The way you combine the looping and if looks like you might be skipping across entries that you cannot parse. But if you find invalid entries while backtracking the IPs, you're probably in parts of the X-Forwarded-For value that you shouldn't trust.
(X-Forwarded-For can easily be spoofed by clients)
We only check that each entry is a valid IPv4 or IPv6 address and not just a random string. If we find invalid entries, then the only option we have is to continue the search further for a valid IP.
list_of_ips = append client_ip to X-Forwarded-For for ip in reverse( list_of_ips ) do if ip is not a trusted proxy or the iterator does not have more
elements
then return ip fi
done
(That should give invalid IP addresses for some requests. That seems to be the correct thing from my point of view. But if you rather geolocate wrong than not geolocate at all, you can choose to pick the last known good IP address instead. I guess that's more a matter of taste.)
What would picking the last known good IP address in this context? An IP that matches the IPv4/IPv6 pattern _AND_ is not a trusted proxy?
Different stakeholders at the WMF use different lists of what kind of proxies they consider. So things are not really clear cut.
Compiling a common list - may be union of "General", "MediaWiki" and "Wikipedia Zero" - would be very useful. As Oliver suggests, it would be good to have them in site-wide config files.
Thanks, Ananth
Hi Ananth,
On Thu, Jan 22, 2015 at 06:38:31PM +0530, Ananth RK wrote:
That looks like (if X-Forwarded-For is “valid”) your jumping straight to considering the X-Forwarded-For header. You might want to check beforehand whether the client ip is a “trusted” proxy. And only if it is, you should walk the X-Forwarded-For.
Also, it sounds a bit like you're going through X-Forwarded-For from left to right. Make sure to walk it from right to left, as proxies are expected to append (not prepend) the client IP they forward for.
Okay. Assuming we append client_ip to X-Forwarded-For and process the list from right-to-left, do you still think we should explicitly check whether client_ip is a trusted proxy? Because, per the modified algorithm, the client_ip will be the first item to be processed and if it is not a trusted proxy, it will be returned as the result - which is what even the explicit initial check will result in this case as well. Correct?
Right, if you implement the pseudo-code algorithm I sketched, you can skip that step.
But if you optimize and loop-unroll the first iteration (like your implementation did), you want to check whether or not it's a trusted proxy. But whether one would label that loop-unrolling “premature optimization” or “making the code more readable” depends a bit on the concrete formulation in code.
The way you combine the looping and if looks like you might be skipping across entries that you cannot parse. But if you find invalid entries while backtracking the IPs, you're probably in parts of the X-Forwarded-For value that you shouldn't trust.
(X-Forwarded-For can easily be spoofed by clients)
We only check that each entry is a valid IPv4 or IPv6 address and not just a random string.
If we find invalid entries, then the only option we have is to continue the search further for a valid IP.
Ex falso quodlibet.
Hence, the only option one has is to stop at untrusted values in the chain of X-Forwarded-For IPs.
list_of_ips = append client_ip to X-Forwarded-For for ip in reverse( list_of_ips ) do if ip is not a trusted proxy or the iterator does not have more
elements
then return ip fi
done
(That should give invalid IP addresses for some requests. That seems to be the correct thing from my point of view. But if you rather geolocate wrong than not geolocate at all, you can choose to pick the last known good IP address instead. I guess that's more a matter of taste.)
What would picking the last known good IP address in this context? An IP that matches the IPv4/IPv6 pattern _AND_ is not a trusted proxy?
As indicated above, I'd rather indicate “Unable to geolocate” than knowingly geolocate wrong.
But by the “last known good IP address” above I meant (in case of faulty X-Forwarded-For), the last IP address that was not obviously wrong could be picked.
Let's have some examples. For the examples, I am assuming that only 1.0.0.0/8 are trusted proxies.
* Case 1:
Client IP: 2.0.0.1 X-Forwarded-For: 1.0.0.2, 1.0.0.3
-> last known good IP address: “2.0.0.1”
* Case 2:
Client IP: 1.0.0.1 X-Forwarded-For: 1.0.0.2, 2.0.0.1
-> last known good IP address: “2.0.0.1”
* Case 3:
Client IP: 1.0.0.1 X-Forwarded-For: foo, 2.0.0.1
-> last known good IP address: “2.0.0.1”
* Case 4:
Client IP: 1.0.0.1 X-Forwarded-For: 1.0.0.2, 1.0.0.3, foo
-> last known good IP address: “1.0.0.1”
* Case 5:
Client IP: 1.0.0.1 X-Forwarded-For: 1.0.0.2, foo, 1.0.0.3
-> last known good IP address: “1.0.0.3”
* Case 6:
Client IP: 1.0.0.1 X-Forwarded-For: foo, 1.0.0.2, 1.0.0.3
-> last known good IP address: “1.0.0.2”
* Case 7:
Client IP: foo
-> This should never happen. File a bug in phabricator.
But let me repeat that I think “Unable to geolocate” is a valid and good response. Better than using some proxy in the chain.
Have fun, Christian
Hi Christian,
On Thu, Jan 22, 2015 at 10:48 PM, christian@quelltextlich.at wrote:
Okay. Assuming we append client_ip to X-Forwarded-For and process the
list
from right-to-left, do you still think we should explicitly check whether client_ip is a trusted proxy? Because, per the modified algorithm, the client_ip will be the first item to be processed and if it is not a
trusted
proxy, it will be returned as the result - which is what even the
explicit
initial check will result in this case as well. Correct?
Right, if you implement the pseudo-code algorithm I sketched, you can skip that step.
But if you optimize and loop-unroll the first iteration (like your implementation did), you want to check whether or not it's a trusted proxy. But whether one would label that loop-unrolling “premature optimization” or “making the code more readable” depends a bit on the concrete formulation in code.
Okay.
The way you combine the looping and if looks like you might be skipping across entries that you cannot parse. But if you find invalid entries while backtracking the IPs, you're probably in parts of the X-Forwarded-For value that you shouldn't
trust.
(X-Forwarded-For can easily be spoofed by clients)
We only check that each entry is a valid IPv4 or IPv6 address and not
just
a random string.
If we find invalid entries, then the only option we have is to continue the search further for a valid IP.
Ex falso quodlibet.
(Oh Latin! How much I missed you!)
Hence, the only option one has is to stop at untrusted values in the chain of X-Forwarded-For IPs.
And pick the last known good IP?
list_of_ips = append client_ip to X-Forwarded-For for ip in reverse( list_of_ips ) do if ip is not a trusted proxy or the iterator does not have more
elements
then return ip fi
done
(That should give invalid IP addresses for some requests. That seems to be the correct thing from my point of view. But if you rather geolocate wrong than not geolocate at all, you can choose to pick the last known good IP address instead. I guess that's more a matter of taste.)
What would picking the last known good IP address in this context? An IP that matches the IPv4/IPv6 pattern _AND_ is not a trusted proxy?
As indicated above, I'd rather indicate “Unable to geolocate” than knowingly geolocate wrong.
But by the “last known good IP address” above I meant (in case of faulty X-Forwarded-For), the last IP address that was not obviously wrong could be picked.
Let's have some examples. For the examples, I am assuming that only 1.0.0.0/8 are trusted proxies.
Case 1:
Client IP: 2.0.0.1 X-Forwarded-For: 1.0.0.2, 1.0.0.3
-> last known good IP address: “2.0.0.1”
Case 2:
Client IP: 1.0.0.1 X-Forwarded-For: 1.0.0.2, 2.0.0.1
-> last known good IP address: “2.0.0.1”
Case 3:
Client IP: 1.0.0.1 X-Forwarded-For: foo, 2.0.0.1
-> last known good IP address: “2.0.0.1”
Case 4:
Client IP: 1.0.0.1 X-Forwarded-For: 1.0.0.2, 1.0.0.3, foo
-> last known good IP address: “1.0.0.1”
Case 5:
Client IP: 1.0.0.1 X-Forwarded-For: 1.0.0.2, foo, 1.0.0.3
-> last known good IP address: “1.0.0.3”
Case 6:
Client IP: 1.0.0.1 X-Forwarded-For: foo, 1.0.0.2, 1.0.0.3
-> last known good IP address: “1.0.0.2”
Case 7:
Client IP: foo
-> This should never happen. File a bug in phabricator.
Thanks for the case-by-case explanation. It is *really* clear.
But let me repeat that I think “Unable to geolocate” is a valid and good response. Better than using some proxy in the chain.
Hm...Let me check with Andrew on whether it is an acceptable behavior for the UDF.
Thanks, Ananth
For Zero, we do a similar XFF decoding here:
https://github.com/wikimedia/operations-puppet/blob/production/templates/var...
On Fri, Jan 23, 2015 at 4:15 AM, Ananth RK ananthrk@ymxdata.com wrote:
Hi Christian,
On Thu, Jan 22, 2015 at 10:48 PM, christian@quelltextlich.at wrote:
Okay. Assuming we append client_ip to X-Forwarded-For and process the
list
from right-to-left, do you still think we should explicitly check
whether
client_ip is a trusted proxy? Because, per the modified algorithm, the client_ip will be the first item to be processed and if it is not a
trusted
proxy, it will be returned as the result - which is what even the
explicit
initial check will result in this case as well. Correct?
Right, if you implement the pseudo-code algorithm I sketched, you can skip that step.
But if you optimize and loop-unroll the first iteration (like your implementation did), you want to check whether or not it's a trusted proxy. But whether one would label that loop-unrolling “premature optimization” or “making the code more readable” depends a bit on the concrete formulation in code.
Okay.
The way you combine the looping and if looks like you might be skipping across entries that you cannot parse. But if you find invalid entries while backtracking the IPs, you're probably in parts of the X-Forwarded-For value that you shouldn't
trust.
(X-Forwarded-For can easily be spoofed by clients)
We only check that each entry is a valid IPv4 or IPv6 address and not
just
a random string.
If we find invalid entries, then the only option we have is to continue the search further for a valid IP.
Ex falso quodlibet.
(Oh Latin! How much I missed you!)
Hence, the only option one has is to stop at untrusted values in the chain of X-Forwarded-For IPs.
And pick the last known good IP?
list_of_ips = append client_ip to X-Forwarded-For for ip in reverse( list_of_ips ) do if ip is not a trusted proxy or the iterator does not have more
elements
then return ip fi
done
(That should give invalid IP addresses for some requests. That seems to be the correct thing from my point of view. But if you rather geolocate wrong than not geolocate at all, you can choose to pick the last known good IP address instead. I guess that's more a matter of taste.)
What would picking the last known good IP address in this context? An IP that matches the IPv4/IPv6 pattern _AND_ is not a trusted proxy?
As indicated above, I'd rather indicate “Unable to geolocate” than knowingly geolocate wrong.
But by the “last known good IP address” above I meant (in case of faulty X-Forwarded-For), the last IP address that was not obviously wrong could be picked.
Let's have some examples. For the examples, I am assuming that only 1.0.0.0/8 are trusted proxies.
Case 1:
Client IP: 2.0.0.1 X-Forwarded-For: 1.0.0.2, 1.0.0.3
-> last known good IP address: “2.0.0.1”
Case 2:
Client IP: 1.0.0.1 X-Forwarded-For: 1.0.0.2, 2.0.0.1
-> last known good IP address: “2.0.0.1”
Case 3:
Client IP: 1.0.0.1 X-Forwarded-For: foo, 2.0.0.1
-> last known good IP address: “2.0.0.1”
Case 4:
Client IP: 1.0.0.1 X-Forwarded-For: 1.0.0.2, 1.0.0.3, foo
-> last known good IP address: “1.0.0.1”
Case 5:
Client IP: 1.0.0.1 X-Forwarded-For: 1.0.0.2, foo, 1.0.0.3
-> last known good IP address: “1.0.0.3”
Case 6:
Client IP: 1.0.0.1 X-Forwarded-For: foo, 1.0.0.2, 1.0.0.3
-> last known good IP address: “1.0.0.2”
Case 7:
Client IP: foo
-> This should never happen. File a bug in phabricator.
Thanks for the case-by-case explanation. It is *really* clear.
But let me repeat that I think “Unable to geolocate” is a valid and good response. Better than using some proxy in the chain.
Hm...Let me check with Andrew on whether it is an acceptable behavior for the UDF.
Thanks, Ananth
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hm...Let me check with Andrew on whether it is an acceptable behavior for the UDF.
Totally fine. Thanks!
On Jan 23, 2015, at 04:15, Ananth RK ananthrk@ymxdata.com wrote:
Hi Christian,
On Thu, Jan 22, 2015 at 10:48 PM, <christian@quelltextlich.at mailto:christian@quelltextlich.at> wrote:
Okay. Assuming we append client_ip to X-Forwarded-For and process the list from right-to-left, do you still think we should explicitly check whether client_ip is a trusted proxy? Because, per the modified algorithm, the client_ip will be the first item to be processed and if it is not a trusted proxy, it will be returned as the result - which is what even the explicit initial check will result in this case as well. Correct?
Right, if you implement the pseudo-code algorithm I sketched, you can skip that step.
But if you optimize and loop-unroll the first iteration (like your implementation did), you want to check whether or not it's a trusted proxy. But whether one would label that loop-unrolling “premature optimization” or “making the code more readable” depends a bit on the concrete formulation in code.
Okay.
The way you combine the looping and if looks like you might be skipping across entries that you cannot parse. But if you find invalid entries while backtracking the IPs, you're probably in parts of the X-Forwarded-For value that you shouldn't trust.
(X-Forwarded-For can easily be spoofed by clients)
We only check that each entry is a valid IPv4 or IPv6 address and not just a random string.
If we find invalid entries, then the only option we have is to continue the search further for a valid IP.
Ex falso quodlibet.
(Oh Latin! How much I missed you!)
Hence, the only option one has is to stop at untrusted values in the chain of X-Forwarded-For IPs.
And pick the last known good IP?
list_of_ips = append client_ip to X-Forwarded-For for ip in reverse( list_of_ips ) do if ip is not a trusted proxy or the iterator does not have more
elements
then return ip fi
done
(That should give invalid IP addresses for some requests. That seems to be the correct thing from my point of view. But if you rather geolocate wrong than not geolocate at all, you can choose to pick the last known good IP address instead. I guess that's more a matter of taste.)
What would picking the last known good IP address in this context? An IP that matches the IPv4/IPv6 pattern _AND_ is not a trusted proxy?
As indicated above, I'd rather indicate “Unable to geolocate” than knowingly geolocate wrong.
But by the “last known good IP address” above I meant (in case of faulty X-Forwarded-For), the last IP address that was not obviously wrong could be picked.
Let's have some examples. For the examples, I am assuming that only 1.0.0.0/8 http://1.0.0.0/8 are trusted proxies.
Case 1:
Client IP: 2.0.0.1 X-Forwarded-For: 1.0.0.2, 1.0.0.3
-> last known good IP address: “2.0.0.1”
Case 2:
Client IP: 1.0.0.1 X-Forwarded-For: 1.0.0.2, 2.0.0.1
-> last known good IP address: “2.0.0.1”
Case 3:
Client IP: 1.0.0.1 X-Forwarded-For: foo, 2.0.0.1
-> last known good IP address: “2.0.0.1”
Case 4:
Client IP: 1.0.0.1 X-Forwarded-For: 1.0.0.2, 1.0.0.3, foo
-> last known good IP address: “1.0.0.1”
Case 5:
Client IP: 1.0.0.1 X-Forwarded-For: 1.0.0.2, foo, 1.0.0.3
-> last known good IP address: “1.0.0.3”
Case 6:
Client IP: 1.0.0.1 X-Forwarded-For: foo, 1.0.0.2, 1.0.0.3
-> last known good IP address: “1.0.0.2”
Case 7:
Client IP: foo
-> This should never happen. File a bug in phabricator.
Thanks for the case-by-case explanation. It is *really* clear.
But let me repeat that I think “Unable to geolocate” is a valid and good response. Better than using some proxy in the chain.
Hm...Let me check with Andrew on whether it is an acceptable behavior for the UDF.
Thanks, Ananth
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics