Another topic at MW Summit was using sql_mode to avoid some of MySQL's odd
legacy behaviour. They are:
You can set sql_mode per client connection on analytics-store without
affecting anyone else or replication.
Yesterday at MW Summit there were some non-MariaDB options tossed aorund.
More Hadoop usage than current efforts.
Druid http://druid.io/ (Dan)
RethinkDB http://rethinkdb.com/ (Ori)
Hyperdex http://hyperdex.org/ (Sean)
TokuMX http://www.tokutek.com/tokumx-for-mongodb/ (Sean)
No data yet on whether any or all are useful or not. Druid is probably the
most established of the bunch. TokuMX is a MongoDB clone using TokuDB under
the hood, the tech we already use to make analytics-store not explode
replicating all shards.
If anyone is keen to experiment I think they should speak up and ask for
hardware. Ops has some misc boxes around, plus a number of ex-MariaDB Dell
R510's that might suit experiementation.
Yesterday at the MW Summit I mentioned to Dan/Nuria/Halfak that
analytics-store disk was ~20% used. JFTR that was wrong; it's actually
closer to 45% used.
It's a 6T RAID10 10K array holding S1-7, Eventlogging, Staging, and the
Data Warehouse test schema. Still enough space for a good while at current
growth rate, but we'll want to review it periodically.
We currently have a method in Geo UDF that takes an IP address from the
remote host header and the X-Forwarded-For value and attempts to identify
the originating client IP address by following a simple algorithm as
IF X-Forwarded-For value is not valid:
FOR EACH valid IP address, proxy_ip, in the comma-separated
IF proxy_ip does not start with "127.0" or "192.168" or "10."
What are the ways in which this naive algorithm can be improved? For
example, is it better to maintain a separate list of IP address to ignore
(currently only 4)? If yes, how do we ensure that the list is exhaustive?
Any other improvements?
My name is Neta Livneh and I'm a FOSS OPW Intern, working with the language
engineering team on estimating how many translated pages there are in
For my project, we will need to sql queries on current wikipedia data
(mostly revision history table).
I already have a Gerrit account. Can I get SSH access for running such
Thanks for the detailed response.
> That looks like (if X-Forwarded-For is “valid”) your jumping straight
> to considering the X-Forwarded-For header. You might want to check
> beforehand whether the client ip is a “trusted” proxy. And only if it
> is, you should walk the X-Forwarded-For.
> Also, it sounds a bit like you're going through X-Forwarded-For from
> left to right. Make sure to walk it from right to left, as proxies are
> expected to append (not prepend) the client IP they forward for.
Okay. Assuming we append client_ip to X-Forwarded-For and process the list
from right-to-left, do you still think we should explicitly check whether
client_ip is a trusted proxy? Because, per the modified algorithm, the
client_ip will be the first item to be processed and if it is not a trusted
proxy, it will be returned as the result - which is what even the explicit
initial check will result in this case as well. Correct?
> The way you combine the looping and if looks like you might be
> skipping across entries that you cannot parse.
> But if you find invalid entries while backtracking the IPs, you're
> probably in parts of the X-Forwarded-For value that you shouldn't trust.
> (X-Forwarded-For can easily be spoofed by clients)
We only check that each entry is a valid IPv4 or IPv6 address and not just
a random string. If we find invalid entries, then the only option we have
is to continue the search further for a valid IP.
> list_of_ips = append client_ip to X-Forwarded-For
> for ip in reverse( list_of_ips )
> if ip is not a trusted proxy or the iterator does not have more
> return ip
> (That should give invalid IP addresses for some requests. That
> seems to be the correct thing from my point of view. But if you rather
> geolocate wrong than not geolocate at all, you can choose to pick the
> last known good IP address instead. I guess that's more a matter of
What would picking the last known good IP address in this context? An IP
that matches the IPv4/IPv6 pattern _AND_ is not a trusted proxy?
> Different stakeholders at the WMF use different lists of what kind of
> proxies they consider. So things are not really clear cut.
Compiling a common list - may be union of "General", "MediaWiki" and
"Wikipedia Zero" - would be very useful. As Oliver suggests, it would be
good to have them in site-wide config files.
+ product, analytics as fyi
On Thu, Jan 15, 2015 at 3:27 PM, Jon Katz <jkatz(a)wikimedia.org> wrote:
> Hi Folks,
> Just sharing some data gleaned from the logs that I pulled in relation to
> the collections project. I looked at clicks on the watchlist star (see
> image), because we are considering that to be the first entry point for
> creating a collection from an article on mobile web. Nothing
> earth-shattering, but the data suggests that the pilot will not generate
> enough data without:
> - a tutorial element/signage
> - improved account creation funnel
> - additional entry points (such as search results)
> [image: Inline image 1][image: Inline image 4]
> *Summary: *
> - Data suggests additional entry points might be necessary in order to
> derive meaningful results from the pilot.
> - Current expected user population reaching collections entry point
> without further intervention 2.4M per month. Of those, 28.6k are logged in
> and able to access the feature funnel.
> - Another 8.5k users end up signing up as a result of the prompt
> (.3% conversion rate).
> - Baseline established to measure interventions against.
> - More baselines to come.
> - in a 30 day period in April, ~28.6k logged in users clicked on
> watchlist star (2.4M anonymous clicked, prompted to create account)
> - in a 30 day period in June*, ~8.5k of the previously logged-out
> users successfully created accounts after clicking on watchlist star (.3%
> conversion rate). I do not know if these should be counted in the 28.6k
> logged in "watch" events.
> - Logins that resulted from CTA not tracked, but assumed to be low and
> might trigger the measured events we did track.
> - Chart view, clicks on watchlist star (see methods>assumptions,
> below, for anonymous user calc):
> [image: Inline image 2]
> *How this informs Collections:*
> - If nothing changes, ~2.4M people have potential to see watchlist
> creation prompt, of whom 28.6k are immediately eligible to create (don't
> need login/account?). I think this latter number is a little low for a
> pilot. You? Should we explore other entry points for pilot, like search
> - We have a baseline of ~81.5k watchlist clicks per month by logged in
> users and 2.8M from loggedout. How much can we increase these numbers? I
> don't think this number *needs* to grow significantly for success,
> because 1 collection could drive infinite views, but it would definitely
> - Any overlay or product introduction will need to increase this
> number to be successful
> - Changing the icon from a star to a heart or some thing else more
> collection-y would also have to increase this number to be succesful
> - Account creation rate has room to grow
> - The CTA has a .3% conversion rate, which is very very low. If we
> can boost this, we might be able to have a large impact on the number of
> people who are eligible to create.
> - Can we drive this number up by offering collections as an
> *Next Steps*
> - We need to explore alternate entry points - such as search, as we
> continue to build out the MVP.
> - I think we should measure the following baselines to establish what
> we should measure against and set realistic expectations:
> - Book creations over time (to compare to collection creations)
> - Edits on mobile (to compare to collection creations/edits)
> - Watchlist menu clicks (similar path to editing your collections)
> - Referrals via social (to establish baseline for social/share
> - Any others?
> - Queries used:
> - Queried MobileBetaWatchlist_5281061 for 30 days--tracks clicks on
> star (April 2014). Link to query
> - Queried ServerSideAccountCreation_5487345 for 30 days in (June
> 2014) Link to query
> - Assumptions
> - Assume unique [IP+UserAgent] = unique anonymous user. Definitely
> undercounts users this way.
> - Assume number of users who login as a result of clicking on
> watchlist is small (not currently measured) and/or succesful logins from
> this prompt sucesfully add article to watchlist and trigger event that we
> *did* measure. So the above count might be underestimating total
> - Assuming addition is appropriate between the logged in watchlist
> clicks and the account creations. I ran a join to see if a succesful
> account creation then triggered a successful version of the original
> watchlist event (i.e. anon click to watch article-->account signup
> prompt-->create account-->article watch click triggered), but it kept
> stalling out. This is the query, in case you're curious
> - Data alert
> - I had a rough time with some queries because the event logs had
> events from Jan- Dec, but there were obvious dropoffs and spikes due to
> changes that were only caught when I ran a daily report for events in each
> table. Without this the data would have been very very skewed (my original
> count for anon watchlist clicks was 57k per month...it is actually 2.4M). Here
> is a link
> to a helpful daily event query.
> *necessary to use different time periods because the logs did not cleanly