Hi,
just a quick heads up, that Ops are about to add a “php” key to the
X-Analytics header (i.e.: for sampled-1000 logs, hive, ...):
https://gerrit.wikimedia.org/r/#/c/156793/
This header will hold the used PHP implementation [1].
Planned deployment is between 2014-09-01 and 2014-09-02.
Have fun,
Christian
[1] https://wikitech.wikimedia.org/wiki/X-Analytics#Keys
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hello,
We kicked off our next sprint this morning, with the help of some release
planning executed during the last 2 weeks. The sprint status is here:
http://sb.wmflabs.org/t/analytics-developers/2014-10-30/
The focus of this sprint is working on the backend in preparation to
display new data in Vital Signs.
Bug ID
Component
Summary
Points
72740
Dashiki
Story: Vital Signs User selects the Daily Pageviews metrics
34
72741
EventLogging
List tables/schemas with data retention needs
0
72642
EventLogging
Story: Identify and direct the purging of Event logging raw logs older
than 90 days in stat1002
0
67450
EventLogging
database consumer could batch inserts (sometimes)
34
72746
Wikimetrics
Story: WikimetricsUser tags a cohort using a pre-defined tag
5
72635
Wikimetrics
report table performance, cleanup, and number of items
13
That’s 86 points in 4 stories.
The bugs with 0 points are tasks for the team to track and follow up on,
and the work mostly falls on other teams.
Regards,
Kevin Leduc
Hello,
To comply with our privacy policy we are going to purge logs in 1002 that
are older than 90 days. Please let us know whether this is an issue. We
hope to have these changes done by the end of next week.
A concrete example:
Logs in, for example, the eventlogging archiving directory:
@stat1002:/a/eventlogging/archive$
will be restricted to the last 90 days.
Thanks,
Nuria
Hi,
just a quick heads up that the replication lag on
analytics-store.eqiad.wmnet (i.e.: s5-analytics-slave, dbstore1002),
is currently at 17 hours and increasing.
I filed RT ticket 8788:
https://rt.wikimedia.org/Ticket/Display.html?id=8788
Best regards,
Christian
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hi,
in the week from 2014-10-20–2014-10-26 Andrew, Jeff, and I worked on
the following items around the Analytics Cluster and Analytics related
Ops:
* Research on columnar storage in the cluster
* Research on how to count of access to media files
* Rolling out ACK tuning for varnishkafka
* More work towards getting application id into logstash
(details below)
Have fun,
Christian
* Research on columnar storage in the cluster
Columnar storage engines can help to speed up some queries we're
running and plan to run. So some more research around Parquet and AVRO
was done, and how xmldumps imports could benefit them.
* Research on how to count of access to media files
We had many requests making access counts for media files
public. Since the basic infrastructural ingredients are within reach,
we started to explore what would be doable towards getting such data
public.
* Rolling out ACK tuning for varnishkafka
As reported for the previous week, the ACK tuning of varnishkafka
showed to avoid message loss during leader elections. So we're
incrementally deploying the new ACK parameter to caches, and 3 out of
4 clusters are using it already. The deployment for the fourth cluster
is still pending.
* More work towards getting application id into logstash
Repackaging jars to inject the log4j configurations allowed to get
more logs into logstash. And we're also starting to extract
application ids from log messages, which will finally allow to go to
logstash to get and filter logs for the applications (like Hive
queries) one is running on the cluster.
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hi,
in the week from 2014-10-13–2014-10-19 Andrew, Jeff, and I worked on
the following items around the Analytics Cluster and Analytics related
Ops:
* Webstatscollector deployment (Bug 66352, Bug 71790)
* Testing potential kafkatee fix
* Analytics1021, its partition leader role, and missing data
* gp.wmflabs.org showing empty graphs
* Database lags
* Obtaining HTTPS numbers to assist around POODLE vulnerability
* Redeployment of some Hive scripts
* Preparations for ua_parser Hive UDF
(details below)
Have fun,
Christian
* Webstatscollector deployment (Bug 66352, Bug 71790)
As reported previous weeks, new webstatscollector builds have been
prepared to stop counting requests to the “Undefined” page (Bug
66352), and to stop counting redirects twice (Bug 71790). Those new
builds now got deployed to both webstatscollector pipelines.
* Testing potential kafkatee fix
From time to time kafkatee did not consume from all relevant kafka
partitions. The kafkatee maintainer provided a potential fix that is
running on analytics1003 since. The kafkatee generated files look good
for now, but since the issue previously took some time to manifest,
the tests need to run a bit longer.
* Analytics1021, its partition leader role, and missing data
Analytics1021 again dropped out of its partition leader role.
This is the first time it happened after ACK parameters got tuned on
some machines. The tuning proved to be worth it, as the caches with
tuned ACK parameters did not see message loss.
Since the issue happened again later, and again exactly the machines
with tuned ACK parameters did not see message loss, we can prepare to
roll out the tuned ACK parameters more widely.
* gp.wmflabs.org showing empty graphs
In 2013 some graphs of gp.wmflabs.org have been taken offline due to
privacy concerns. However, the main dashboard still referenced some of
those graphs, and rendered them as empty graphs. This made the
dashboard /look/ broken, although the public graphs were rendered as
expected. We updated the dashboard to no longer reference offline
graphs, so the dashboard does not look broken any longer.
* Database lags
Due to different, unrelated causes, some databases lagged considerably
during this week. Ops got the databases back to normal again.
* Obtaining HTTPS numbers to assist around POODLE vulnerability
In order to decide on how to address the POODLE vulnerability, Ops
needed numbers on usage of HTTPS for old browsers. Since this data is
not prepared automatically, we extracted the numbers from the logs.
* Redeployment of some Hive scripts
It seems an unannounced Friday deployment during the SF hackathon
angered the deployment gods, and caused some Oozie/Hive jobs to not
run correctly. So we had to fix the setup, resubmit the jobs, and
backfill the missing data. No data got lost.
* Preparations for ua_parser UDF
There is a push from several sides to have a Hive UDF that can parse
User-Agents. A good part of time was spent implementing, and reviewing
this UDF. But it's not yet merged and will require a bit more work.
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Both of the presentations at the October Wikimedia Research Showcase were
fascinating and I encourage everyone to watch them [1]. I would like to
continue to discuss the themes from the showcase about Wikipedia's
adaptability, viability, and diversity.
Aaron's discussion about Wikipedia's ongoing internal adaptations, and
the slowing of those adaptations, reminded me of this statement from MIT
Technology Review in 2013 (and I recommend reading the whole article [2]):
"The main source of those problems (with Wikipedia) is not mysterious. The
loose collective running the site today, estimated to be 90 percent male,
operates a crushing bureaucracy with an often abrasive atmosphere that
deters newcomers who might increase partipcipation in Wikipedia and broaden
its coverage."
I would like to contrast that vision of Wikipedia with the vision presented
by User:CatherineMunro (formatting tweaks by me), which I re-read when I
need encouragement:
"THIS IS AN ENCYCLOPEDIA
One gateway
to the wide garden of knowledge,
where lies
The deep rock of our past,
in which we must delve
The well of our future,
The clear water
we must leave untainted
for those who come after us,
The fertile earth,
in which truth may grow
in bright places,
tended by many hands,
And the broad fall of sunshine,
warming our first steps
toward knowing
how much we do not know."
How can we align ouselves less with the former vision and more with the
latter? [3]
I hope that we can continue to discuss these themes on the Research mailing
list. Please contribute your thoughts and questions there.
Regards,
Pine
[1] youtube.com/watch?v=-We4GZbH3Iw
[2]
http://www.technologyreview.com/featuredstory/520446/the-decline-of-wikiped…
[3] Lest this at first seem to be impossible, I will borrow and tweak a
quote from from George Bernard Shaw and later used by John F. Kennedy:
"Some people see things as they are and say, 'Why?' Let us dream things
that never were and say, 'Why not?'"
Forwarding comments from Wikimedia-l that may be of interest to a number of
subscribers on other lists.
Pine
---------- Forwarded message ----------
From: "Erik Moeller" <erik(a)wikimedia.org>
Date: Oct 25, 2014 5:59 PM
Subject: Re: [Wikimedia-l] Chapters and GLAM tooling
To: "Wikimedia Mailing List" <wikimedia-l(a)lists.wikimedia.org>
Cc:
On Sat, Oct 25, 2014 at 7:16 AM, MZMcBride <z(a)mzmcbride.com> wrote:
> Labs is a playground and Galleries, Libraries, Archives, and Museums are
> serious enough to warrant a proper investment of resources, in my view.
> Magnus and many others develop magnificent tools, but my sense is that
> they're largely proofs of concept, not final implementations.
Far from being treated as mere proofs of concept, Magnus' GLAM tools
[1] have been used to measure and report success in the context of
project grant and annual plan proposals and reports, ongoing project
performance measurements, blog posts and press releases, etc. Daniel
Mietchen has, to my knowledge, been the main person doing any
systematic auditing or verification of the reports generated by these
tools, and results can be found in his tool testing reports, the last
one of which is unfortunately more than a year old. [2]
Integration with MediaWiki should IMO not be viewed as a runway that
all useful developments must be pushed towards. Rather, we should seek
to establish clearer criteria by which to decide that functionality
benefits from this level of integration, to such an extent that it
justifies the cost. Functionality that is not integrated in this
manner should, then, not be dismissed as "proofs of concept" but
rather judged on its own merits.
GWToolset [3] is a good example. It was built as a MediaWiki extension
to manage GLAM batch uploads, but we should not regard this decision
as sacrosanct, or the only correct way to develop this kind of
functionality. The functionality it provides is of highly specialized
interest, and indeed, the number of potential users to-date is 47
according to [4], most of whom have not performed significant uploads
yet. Its user interface is highly specialized and special permissions
+ detailed instructions are required to use it. At the same time, it
has been used to upload 322,911 files overall, an amazing number even
without going into the quality and value of the individual
collections.
So, why does it need to be a MediaWiki extension at all? When
development began in 2012, OAuth support in MediaWiki did not exist,
so it was impossible for an external tool (then running on toolserver)
to manage an upload on the user's behalf without asking for the user's
password, which would have been in violation of policy. But today, we
have other options. It's possible that storage requirements or other
specific desired integration points would make it impossible to create
this as a Tool Labs tool -- but if we created the same tool today, we
should carefully consider that.
Indeed, highly specialized tools for the cultural and education sector
_are_ being developed and hosted inside Tool Labs or externally.
Looking at the current OAuth consumer requests [5], there are
submissions for a metadata editor developed by librarians at the
University of Miami Libraries in Coral Gables, Florida, and an
assignment creation wizard developed by the Wiki Education Foundation.
There's nothing "improper" about that, as Marc-André pointed out.
As noted before, for tools like the ones used for GLAM reporting to
get better, WMF has its role to play in providing more datasets and
improved infrastructure. But there's nothing inherent in the
development of those tools that forces them to live in production
land, or that requires large development teams to move them forward.
Auditing of numbers, improved scheduling/queuing of database requests,
optimization of API calls and DB queries; all of this can be done by
individual contributors, making this suitable work for even chapters
with limited experience managing technical projects to take on.
On the analytics side, we're well aware that many users have asked for
better access to the pageview data, either through MariaDB, or through
a dedicated API. We have now said for some time that our focus is on
modernizing the infrastructure for log analysis and collection,
because the numbers collected by the old webstatscollector code were
incomplete, and the infrastructure subject to frequent packet loss
issues. In addition, our ability to meet additional requirements on
the basis of simple pageview aggregation code was inherently
constrained.
To this end, we have put into production use infrastructure to collect
and analyze site traffic using Kafka/Hadoop/Hive. At our scale, this
has been a tremendously complex infrastructure project which has
included custom development such as varnishkafka [6]. While it's taken
longer than we've wanted, this new infrastructure is being used to
generate a public page count dataset as of this month, including
article-level mobile traffic for the first time [7]. Using
Hadoop/Hive, we'll be able to compile many more specialized reports,
and this is only just beginning.
Giving community developers better access to this data needs to be
prioritized relative to other ongoing analytics work, including but
not limited to:
- Continued development and maintenance of the above infrastructure
foundations;
- Development of "Vital Signs": public reports on editor activity,
content contribution, sign-ups and other metrics. This tool gives us
more timely access to key measures than WikiStats [9] (or the
reportcard [10], which to-date still consumes Wikistats data). Rather
than having to wait 4-6 weeks to know what's happening with regard to
editor numbers, we can see continuous updates on a day-to-day basis.
- Development of Wikimetrics, which analyzes the editing activity of a
group of editors, and which is essential for measuring all movement
work that targets increased activity by a targeted group (e.g.
editathon), and is a key tool used for grants evaluation (was a funded
program worth the $$?). A lot of thought has gone into the development
of standardized global metrics [12] for program work, much of it
using this technology and dependent on its continued development.
- Measurement (instrumentation) of site actions and
development/maintenance of associated infrastructure. As an example,
in-depth data collection for features like Media Viewer (see
dashboards at [13] ) is only possible because of the EventLogging
extension developed by Ori Livneh, and the increasing use of this
technology by WMF developers. EventLogging requires significant
management, maintenance and teaching effort from the analytics team.
Lila is requesting visibility into all primary funnels on Wikimedia
sites (e.g. sign-ups, edits/saves through wikitext, edits/saves
through VisualEditor, etc.), and this will require lots of sustained
effort from lots of people to get done. What it will give us is a
better sense of where people succeed and fail to complete an action --
by way of example, see the initial UploadWizard funnel analysis here:
https://www.mediawiki.org/wiki/UploadWizard/Funnel_analysis
- Improved software and infrastructure support for A/B testing,
possibly including adoption of existing open source tooling such as
Facebook's PlanOut library/interpreter [14].
- Improved readership metrics, possibly including a privacy-sensitive
approach to estimating Unique Visitors, and better geographic
breakdowns for readers/editors.
These are all complex problems, most of which are dependent on the
small analytics team, and feedback on projects and priorities is very
much welcome on the analytics mailing list:
https://lists.wikimedia.org/mailman/listinfo/analytics
With regard to better embedding of graphs in wikis specifically, Yuri
Astrakhan has led the development of a new extension, inspired by work
by Dan Andreescu, to visualize data directly in wikis. This extension
has been deployed already to Meta and MediaWiki.org and can be used
for dynamic graphs where it's appropriate to not have a fallback to a
static image, for example in grant reports. See:
https://www.mediawiki.org/wiki/Extension:Graphhttps://www.mediawiki.org/wiki/Extension:Graph/Demohttps://meta.wikimedia.org/wiki/Graph:User:Yurik_(WMF)/Obama
I agree this is the kind of functionality that should make its way
into Wikipedia. Again, we need to judge throwing a full team behind
that against the relative priority of other work. In the meantime,
Yuri and others will continue to push it along and may even be able to
get it all the way there in due time. The main blockers, from what I
can tell, are generation of static fallback images for users without
JavaScript, and a better way to manage the data sources.
In general, the point of my original message was this: All
organizations that seek to improve Wikipedia and the other Wikimedia
projects ultimately depend on technology to do so; to view WMF as the
sole "tech provider" does not scale. Larger, well-funded chapters can
take on big, hairy challenges like Wikidata; smaller, less-funded orgs
are better positioned to work on specialized technical support for
programmatic work.
I would caution against requesting WMF to work on highly specialized
solutions for highly specialized problems. If such solutions are
needed, I would caution against building them into MediaWiki unless
they can be generalized to benefit a larger number of users, at which
point it's appropriate to seek partnership with WMF, or to ask WMF for
the relative priority of such work. But often, it's perfectly fine
(and much faster) to build such tools and reports independently, and
to ask WMF for help in providing APIs/services/data/infrastructure to
get it done.
Cheers,
Erik
[1] http://tools.wmflabs.org/glamtools/
[2]
https://outreach.wikimedia.org/wiki/Category:This_Month_in_GLAM_Tool_testin…
[3] https://www.mediawiki.org/wiki/Extension:GWToolset
[4]
https://commons.wikimedia.org/w/index.php?title=Special%3AListUsers&usernam…
[5]
https://www.mediawiki.org/wiki/Special:OAuthListConsumers?name=&publisher=&…
[6] https://github.com/wikimedia/varnishkafka
[7] https://wikitech.wikimedia.org/wiki/Analytics/Pagecounts-all-sites
[8] https://metrics.wmflabs.org/static/public/dash/
[9] http://stats.wikimedia.org/
[10] http://reportcard.wmflabs.org/
[11] https://metrics.wmflabs.org/
[12]
https://meta.wikimedia.org/wiki/Grants:Learning_%26_Evaluation/Global_metri…
[13] http://multimedia-metrics.wmflabs.org/dashboards/mmv
[14] https://github.com/facebook/planout
--
Erik Möller
VP of Product & Strategy, Wikimedia Foundation
_______________________________________________
Wikimedia-l mailing list, guidelines at:
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l(a)lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
<mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
Hi all,
We set up some data collection a few weeks back to look at the distribution
of actual screen size, viewport size and media viewer canvas size on
sampled users. This will ultimately be used to come up with a better choice
of thumbnail size buckets for Media Viewer.
I had some spare time and figured I'd try to generate a visualization of
that data, which we haven't analyzed yet.
Here are the results:
https://upload.wikimedia.org/wikipedia/commons/8/82/Screen_heatmap.pnghttps://upload.wikimedia.org/wikipedia/commons/2/2f/Viewport_heatmap.pnghttps://upload.wikimedia.org/wikipedia/commons/d/d3/Canvas_heatmap.png
(Media Viewer-specific)
I think it shows quite strikingly how screen size really doesn't matter
much compared to the actual available viewport. Hopefully this data will be
useful for other folks too, since I don't believe we tracked that
information before. And given media viewer's traffic, it should be pretty
representative of wikis in general.
The data used to generate those images is all the data we've collected so
far. I haven't looked at differences between wikis, etc. For people with
analytics access, the EL table I dug that data from is
MultimediaViewerDimensions_10014238
Note that this is mostly desktop, since it's very unusual to run Media
Viewer on mobile devices, considering it hasn't been made for it and
themobile site has its own MV-like lightbox.
The code used to generate these is this quick and dirty Processing script I
hacked together: https://phabricator.wikimedia.org/P39