Analytics January 2015

analytics@lists.wikimedia.org

48 participants
47 discussions

Early registration for CSCW 2015 ends January 30th
by Dario Taraborelli 27 Jan '15

27 Jan '15

For those of you interested in attending, the early registration deadline is January 30. See also https://meta.wikimedia.org/wiki/Research:CSCW_2015 <https://meta.wikimedia.org/wiki/Research:CSCW_2015> — — — CSCW 2015 | March 14-18 | Vancouver, BC, Canada http://cscw.acm.org <http://cscw.acm.org/> * Early registration ends January 30th. * Advance program is available at http://cscw.acm.org/2015/program/ <http://cscw.acm.org/2015/program/> * Conference hotel rooms are already selling out at $135/night The 18th ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW 2015) will be held March 14-18 in Vancouver, BC, Canada and is co-located with ACM Learning at Scale. CSCW is the premier venue for presenting research in the design and use of technologies that affect groups, organizations, communities, and networks. Bringing together top researchers and practitioners from academia and industry in the area of social computing, CSCW encompasses both the technical and social challenges encountered when supporting collaboration. Jeff Hancock from Cornell University will give the opening keynote address discussing "The Facebook Study: A Personal Account of Data Science, Ethics and Change." The closing keynote speaker will be Zeynep Tufekci from University of North Carolina Chapel Hill, speaking on "Algorithms in our Midst: Information, Power and Choice when Software is Everywhere." We are also pleased to announce that Wanda Orlikowski will receive the CSCW 2015 Lasting Impact Award and present a retrospective on her groundbreaking 1992 paper, "Learning from Notes: organizational issues in groupware implementation." Registration is available at: http://cscw.acm.org/2015/attend/registration.php <http://cscw.acm.org/2015/attend/registration.php> Registration questions? Ask Yvonne Lopez Hotel reservations can be made at: http://cscw.acm.org/2015/attend/hotel.php <http://cscw.acm.org/2015/attend/hotel.php> Use #CSCW2015 and follow us at http://twitter.com/ACM_CSCW <http://twitter.com/ACM_CSCW> or http://www.facebook.com/acmCSCW <http://www.facebook.com/acmCSCW> for updates. Conference Co-chairs Dan Cosley, Cornell University Andrea Forte, Drexel University chairs2015(a)cscw.acm.org <mailto:chairs2015@cscw.acm.org>

1 0

MariaDB sql_mode
by Sean Pringle 27 Jan '15

27 Jan '15

Another topic at MW Summit was using sql_mode to avoid some of MySQL's odd legacy behaviour. They are: https://mariadb.com/kb/en/mariadb/sql_mode/ You can set sql_mode per client connection on analytics-store without affecting anyone else or replication.

1 0

Non-MariaDB options
by Sean Pringle 27 Jan '15

27 Jan '15

Yesterday at MW Summit there were some non-MariaDB options tossed aorund. They were: More Hadoop usage than current efforts. Druid http://druid.io/ (Dan) RethinkDB http://rethinkdb.com/ (Ori) Hyperdex http://hyperdex.org/ (Sean) TokuMX http://www.tokutek.com/tokumx-for-mongodb/ (Sean) No data yet on whether any or all are useful or not. Druid is probably the most established of the bunch. TokuMX is a MongoDB clone using TokuDB under the hood, the tech we already use to make analytics-store not explode replicating all shards. If anyone is keen to experiment I think they should speak up and ask for hardware. Ops has some misc boxes around, plus a number of ex-MariaDB Dell R510's that might suit experiementation. BR Sean

2 1

analytics-store disk space
by Sean Pringle 27 Jan '15

27 Jan '15

Yesterday at the MW Summit I mentioned to Dan/Nuria/Halfak that analytics-store disk was ~20% used. JFTR that was wrong; it's actually closer to 45% used. It's a 6T RAID10 10K array holding S1-7, Eventlogging, Staging, and the Data Warehouse test schema. Still enough space for a good while at current growth rate, but we'll want to review it periodically. Sean

1 0

Better way to identify client IP address
by Ananth RK 27 Jan '15

27 Jan '15

We currently have a method in Geo UDF that takes an IP address from the remote host header and the X-Forwarded-For value and attempts to identify the originating client IP address by following a simple algorithm as follows: GetClientIP(remote_host, X-Forwarded-For) IF X-Forwarded-For value is not valid: return remote_host ELSE: FOR EACH valid IP address, proxy_ip, in the comma-separated X-Forwarded-For value: IF proxy_ip does not start with "127.0" or "192.168" or "10." or "169.254": return proxy_ip What are the ways in which this naive algorithm can be improved? For example, is it better to maintain a separate list of IP address to ignore (currently only 4)? If yes, how do we ensure that the list is exhaustive? Any other improvements? Thanks, Ananth

6 8

Getting Access to Wikipedia Database
by Neta Livneh 26 Jan '15

26 Jan '15

Hi everyone, My name is Neta Livneh and I'm a FOSS OPW Intern, working with the language engineering team on estimating how many translated pages there are in different wikipedias. For my project, we will need to sql queries on current wikipedia data (mostly revision history table). I already have a Gerrit account. Can I get SSH access for running such queries? Thanks, Neta

7 14

Re: [Analytics] u2opiamobile + WMF tech contacts
by Adam Baso 24 Jan '15

24 Jan '15

(message contents deleted)

1 0

Wikimedia referrer policy
by Dario Taraborelli 23 Jan '15

23 Jan '15

I’ve been discussing with the folks at CrossRef (the largest registry of Digital Object Identifiers, think of it as the ICANN of science) how to accurately measure the impact of traffic driven from Wikipedia/Wikimedia to scholarly resources. While digging into their data, we realized that since Wikimedia started the HTTPS switchover and an increasing portion of inbound traffic happens over SSL, Wikimedia sites may have stopped advertising themselves as sources of referred traffic to external sites. While this is a literal implication of HTTPS, it means that Wikimedia's impact on traffic directed to other sites is becoming largely invisible and Wikimedia might be turning into a large source of dark traffic. I wrote a proposal reviewing the CrossRef use case and discussing how other top web properties deal with this issue by adopting a so-called "Referrer Policy”: https://meta.wikimedia.org/wiki/Research:Wikimedia_referrer_policy <https://meta.wikimedia.org/wiki/Research:Wikimedia_referrer_policy> Feedback is welcome on the talk page: https://meta.wikimedia.org/wiki/Research_talk:Wikimedia_referrer_policy <https://meta.wikimedia.org/wiki/Research_talk:Wikimedia_referrer_policy> Dario

2 1

Re: [Analytics] Better way to identify client IP address
by Ananth RK 23 Jan '15

23 Jan '15

Thanks for the detailed response. > That looks like (if X-Forwarded-For is “valid”) your jumping straight > to considering the X-Forwarded-For header. You might want to check > beforehand whether the client ip is a “trusted” proxy. And only if it > is, you should walk the X-Forwarded-For. > Also, it sounds a bit like you're going through X-Forwarded-For from > left to right. Make sure to walk it from right to left, as proxies are > expected to append (not prepend) the client IP they forward for. Okay. Assuming we append client_ip to X-Forwarded-For and process the list from right-to-left, do you still think we should explicitly check whether client_ip is a trusted proxy? Because, per the modified algorithm, the client_ip will be the first item to be processed and if it is not a trusted proxy, it will be returned as the result - which is what even the explicit initial check will result in this case as well. Correct? > The way you combine the looping and if looks like you might be > skipping across entries that you cannot parse. > But if you find invalid entries while backtracking the IPs, you're > probably in parts of the X-Forwarded-For value that you shouldn't trust. > (X-Forwarded-For can easily be spoofed by clients) We only check that each entry is a valid IPv4 or IPv6 address and not just a random string. If we find invalid entries, then the only option we have is to continue the search further for a valid IP. > list_of_ips = append client_ip to X-Forwarded-For > for ip in reverse( list_of_ips ) > do > if ip is not a trusted proxy or the iterator does not have more elements > then > return ip > fi > done > (That should give invalid IP addresses for some requests. That > seems to be the correct thing from my point of view. But if you rather > geolocate wrong than not geolocate at all, you can choose to pick the > last known good IP address instead. I guess that's more a matter of > taste.) What would picking the last known good IP address in this context? An IP that matches the IPv4/IPv6 pattern _AND_ is not a trusted proxy? > Different stakeholders at the WMF use different lists of what kind of > proxies they consider. So things are not really clear cut. Compiling a common list - may be union of "General", "MediaWiki" and "Wikipedia Zero" - would be very useful. As Oliver suggests, it would be good to have them in site-wide config files. Thanks, Ananth

4 4

Re: [Analytics] collections entry points
by Jon Katz 22 Jan '15

22 Jan '15

+ product, analytics as fyi On Thu, Jan 15, 2015 at 3:27 PM, Jon Katz <jkatz(a)wikimedia.org> wrote: > Hi Folks, > Just sharing some data gleaned from the logs that I pulled in relation to > the collections project. I looked at clicks on the watchlist star (see > image), because we are considering that to be the first entry point for > creating a collection from an article on mobile web. Nothing > earth-shattering, but the data suggests that the pilot will not generate > enough data without: > > - a tutorial element/signage > - improved account creation funnel > - additional entry points (such as search results) > > -J > > > [image: Inline image 1][image: Inline image 4] > > *Summary: * > > - Data suggests additional entry points might be necessary in order to > derive meaningful results from the pilot. > - Current expected user population reaching collections entry point > without further intervention 2.4M per month. Of those, 28.6k are logged in > and able to access the feature funnel. > - Another 8.5k users end up signing up as a result of the prompt > (.3% conversion rate). > - Baseline established to measure interventions against. > - More baselines to come. > > > *Results* > > - in a 30 day period in April, ~28.6k logged in users clicked on > watchlist star (2.4M anonymous clicked, prompted to create account) > - in a 30 day period in June*, ~8.5k of the previously logged-out > users successfully created accounts after clicking on watchlist star (.3% > conversion rate). I do not know if these should be counted in the 28.6k > logged in "watch" events. > - Logins that resulted from CTA not tracked, but assumed to be low and > might trigger the measured events we did track. > - Chart view, clicks on watchlist star (see methods>assumptions, > below, for anonymous user calc): > > [image: Inline image 2] > > *How this informs Collections:* > > - If nothing changes, ~2.4M people have potential to see watchlist > creation prompt, of whom 28.6k are immediately eligible to create (don't > need login/account?). I think this latter number is a little low for a > pilot. You? Should we explore other entry points for pilot, like search > results? > - We have a baseline of ~81.5k watchlist clicks per month by logged in > users and 2.8M from loggedout. How much can we increase these numbers? I > don't think this number *needs* to grow significantly for success, > because 1 collection could drive infinite views, but it would definitely > help > - Any overlay or product introduction will need to increase this > number to be successful > - Changing the icon from a star to a heart or some thing else more > collection-y would also have to increase this number to be succesful > - Account creation rate has room to grow > - The CTA has a .3% conversion rate, which is very very low. If we > can boost this, we might be able to have a large impact on the number of > people who are eligible to create. > - Can we drive this number up by offering collections as an > incentive? > > *Next Steps* > > - We need to explore alternate entry points - such as search, as we > continue to build out the MVP. > - I think we should measure the following baselines to establish what > we should measure against and set realistic expectations: > - Book creations over time (to compare to collection creations) > - Edits on mobile (to compare to collection creations/edits) > - Watchlist menu clicks (similar path to editing your collections) > - Referrals via social (to establish baseline for social/share > traffic) > - Any others? > > *Methods:* > > - Queries used: > - Queried MobileBetaWatchlist_5281061 for 30 days--tracks clicks on > star (April 2014). Link to query > <https://docs.google.com/a/wikimedia.org/document/d/1-bfW9H2kApvOOOTKL40TjYK…> > . > - Queried ServerSideAccountCreation_5487345 for 30 days in (June > 2014) Link to query > <https://docs.google.com/a/wikimedia.org/document/d/1-bfW9H2kApvOOOTKL40TjYK…> > . > - Assumptions > - Assume unique [IP+UserAgent] = unique anonymous user. Definitely > undercounts users this way. > - Assume number of users who login as a result of clicking on > watchlist is small (not currently measured) and/or succesful logins from > this prompt sucesfully add article to watchlist and trigger event that we > *did* measure. So the above count might be underestimating total > events. > - Assuming addition is appropriate between the logged in watchlist > clicks and the account creations. I ran a join to see if a succesful > account creation then triggered a successful version of the original > watchlist event (i.e. anon click to watch article-->account signup > prompt-->create account-->article watch click triggered), but it kept > stalling out. This is the query, in case you're curious > <https://docs.google.com/a/wikimedia.org/document/d/1-bfW9H2kApvOOOTKL40TjYK…> > . > - Data alert > - I had a rough time with some queries because the event logs had > events from Jan- Dec, but there were obvious dropoffs and spikes due to > changes that were only caught when I ran a daily report for events in each > table. Without this the data would have been very very skewed (my original > count for anon watchlist clicks was 57k per month...it is actually 2.4M). Here > is a link > <https://docs.google.com/a/wikimedia.org/document/d/1-bfW9H2kApvOOOTKL40TjYK…> > to a helpful daily event query. > > > *necessary to use different time periods because the logs did not cleanly > overlap > > >

3 3

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics January 2015