Hi,
since the work that happens around the Analytics Cluster and on the
Ops side of Analytics is not too visible, it was suggested to improve
visibility by having some weekly write-up.
Posting it to the public list for a start, but if this is too much noise
for you, please let us know.
In the week from 2014-08-18–2014-08-24 Andrew, Jeff, and I worked on
the following items around the Analytics Cluster and Analytics related
Ops:
* Hadoop worker memory limits now automatically configured
* Automatic data removal was prepared and activated for webrequest data
* Adjusting access to raw webrequest data
* Learning from data ingestion alarms
* Webstatscollector and kafka
* Distupgrade on stat1003
* Packet loss alarm on oxygen on 2014-08-16 (Bug 69663)
* Geowiki data aggregation failed on 2014-08-19 (Bug 69812)
(details below)
Have fun,
Christian
* Hadoop worker memory limits now automatically configured
Previously, each worker had the same memory limit, regardless of how
much resources the worker really had. By now allowing different memory
limits on different workers, we can better utilize the resources of
each worker.
* Automatic data removal was prepared and activated for webrequest data
Kraken's setup to remove raw webrequest data after a given number of
days (currently: 31) was brought over to refinery and turned on.
* Adjusting access to raw webrequest data
In order to have proper privilege separation on the cluster, access
paths have been split in different groups.
* Learning from data ingestion alarms
With the new monitoring in place, we started to look at the alarms
and are trying to make sense of them. Monitoring seems to work fine,
and the partitions that got flagged, really had issues. On the flip
side, checking for some samples that passed monitoring, they look
valid too. So monitoring seems effective in both directions.
About the flagged partitions, most of them are races on varnish (Bug
69615). No log lines get lost or duplicated for such races.
There was one incident, where a leader re-election caused a drop of
a few hundred log lines (bug 69854). Leader re-election currently
may cause such hiccups, but there is already a theory, what is the
real root cause of such drops, and it should be fixable.
The only other issue was one hour this Saturday (Bug 69971), which
is still pending investigation. It seems it only esams, but all four
sources. But a real investigation is still pending.
So the raw data that is flowing into the cluster is generally
good. And we're starting on ironing out the glitches exposed by the
monitoring.
* Webstatscollector and kafka
We started to work on making webstatscollector consume from
kafka. It's a bit more involved than we hoped (burstiness of kafka,
buffer receive errors, other processes blocking I/O, ...), but the
latest build and setup that is running since about midnight up to
now worked without issues.
*Knocking on wood*
* Distupgrade on stat1003
stat1003 had it's distribution upgraded. New shiny software for
researchers :-)
* Packet loss alarm on oxygen on 2014-08-16 (Bug 69663)
Packetloss was limited to two a few minutes long periods. Root cause
for the issues was bug 69661, which backfired.
* Geowiki data aggregation failed on 2014-08-19 (Bug 69812)
A database connection got dropped, which made the aggregation fail
on 2014-08-19. The root cause of the connection drop is
unknown. Nothing noteworthy happened on used database server,
neither on stat1003 (The distupgrade coincidently took place on the
same day, but happened later in the day). Since this happened for
the first time, we're writing it off as fluke for now.
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Yes, we could look at Google's infoboxes as doing us a favor because they
decrease the load on our servers. We would need to account for those views
in some way if we are interested in quantifying success in the sense of
total views of our content regardless of where it is reproduced.
However, I think Analytics said in a WMF Metrics Meeting presentation that
the number of Google search referrals was not going down enough to explain
the drop in pageviews. I'm copying this email to Analytics in the hope that
they'll comment about the probable causes of the pageview decreases.
Pine
On Sun, Aug 24, 2014 at 6:06 PM, MZMcBride <z(a)mzmcbride.com> wrote:
> Risker wrote:
> >Given the mission is sharing information, I'd suggest that if we have a
> >95% drop in readership, we're failing the mission. Donations are only a
> >means to an end.
>
> I think this assumes a direct correlation between pageviews and sharing
> information and I'm not sure such a direct correlation exists.
>
> When you do a Google search for "abraham lincoln", there's now an infobox
> on the search results page with content from Wikipedia. This could easily
> result in a drop in the number of Wikipedia pageviews, but does that mean
> that Wikipedia is failing its mission? The goal is a world in which we
> freely share in the sum of all human knowledge. If third parties are
> picking up and re-using our free content (and they are), I think we're
> certainly not losing. We may even be winning(!).
>
> We offer bulk-download options for our content, as well as the ability to
> directly query for article content on-demand via the MediaWIki API. Both
> of these access methods very likely result in 0 pageviews being
> registered (XML dump downloads and api.php hits aren't considered
> pageviews, as far as I'm aware), but we're directly sharing content.
>
> As a metric, pageviews are probably not very meaningful. One way we can
> observe whether we're fulfilling our mission is to see how ubiquitous
> our content has become. An even better metric might be the quality of the
> articles we have. Anecdotal evidence suggests that higher article quality
> is not really tied to the readership rate, though perhaps article size is.
>
> MZMcBride
>
>
>
> _______________________________________________
> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> Wikimedia-l(a)lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
>
Sean:
Could explain a little bit why the following bug affects EL data going
public (for the schemas that have public data and can be made public more
easily than others)
https://bugzilla.wikimedia.org/show_bug.cgi?id=67450
Thanks,
Nuria
Hi,
TL;DR: When consuming EventLogging data, only rely on the 'log'
database available from m2 replicas, like analytics-store.eqiad.wmnet.
Other representations might not get updated, might not get fix-ups or
may (on purpose) give you unvalidated data.
----------------------------------
Due to the versatile design of EventLogging, its data exists/existed
in many different representations, which got me confused around the
data quality expectations. Also I could not find them publicly
documented. After talking about different aspects with a few people, I
wanted to put my current understanding of it up for public discussion.
Please let me know (either in private or on list), if something looks
wrong or does not match your use of EventLogging data.
* MySQL / MariaDB database on m2
This database is the best place to consume EventLogging data from.
Available as 'log' database on m2 replicas, such as
analytics-store.eqiad.wmnet.
Only validated events enter the database.
In case of bugs, this database is the only place that gets fixes like
cleanup of historic data, or live fixes.
* 'all-events' JSON log files [1]
Use this data source only to debug issues around ingestion into the m2
database.
Entries are JSON objects.
Only validated events get written.
In case of bugs, historic data does not get fixed.
* Raw client and server side log files [2]
Use this data source only to debug issues around ingestion into the m2
database.
Entries are parameters to the event.gif's request. They are not
decoded at all.
In case of bugs, historic data does not get fixed. Neither need
hot-fixes reach those files.
* Kafka:
EventLogging data is no longer fed into Kafka since 2014-06-12 [3].
The EventLogging data in Kafka had no users.
Turning it on again is tracked in bug 66528 [4].
* MongoDB:
EventLogging data is no longer fed into MongoDB since 2014-02-13 [5].
The EventLogging data in MongoDB did not appear to get used.
I am not aware of plans to revive feeding the data into MongoDB.
* ZMQ:
ZMQ is available from vanadium.
In case of bugs, historic data cannot get fixed :-)
Data coming from the forwarders (ports 8421, 8422) is not validated
and need not see hot-fixes.
Data coming from processors (port 8521, 8522) and multiplexer (port
8600) is validated.
Have fun,
Christian
[1] Available as
stats1002:/a/eventlogging/archive/all-events.log-$DATE.gz
stats1003:/srv/eventlogging/archive/all-events.log-$DATE.gz
vanadium:/var/log/eventlogging/...
[2] Available as
stats1002:/a/eventlogging/archive/client-side-events.log-$DATE.gz
stats1002:/a/eventlogging/archive/server-side-events.log-$DATE.gz
stats1003:/srv/eventlogging/archive/client-side-events.log-$DATE.gz
stats1003:/srv/eventlogging/archive/server-side-events.log-$DATE.gz
vanadium:/var/log/eventlogging/...
[3] https://git.wikimedia.org/commitdiff/operations%2Fpuppet.git/f85b1dbcd61bbb…
[4] https://bugzilla.wikimedia.org/show_bug.cgi?id=66528
[5] https://git.wikimedia.org/commitdiff/operations%2Fpuppet.git/05b4027973c59b…
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hey Dan G and Analytics team,
I wanted to continue and finish the discussion that happened during the
Analytics showcase earlier today.
We're implementing a new feature in Wikimetrics where you can upload a
cohort and check a box so that every user's accounts on other wikis
(projects) will be added to the cohort (using CentralAuth). The purpose is
to see if editors are active on other projects.
The research scientists pointed out that there are issues with CentralAuth
and they are showing up in EventLogging (
https://bugzilla.wikimedia.org/show_bug.cgi?id=66101 ).
Let me try to sum up the issue here:
Suppose someone has an unattached account. She then went to an editathon
and volunteered her name to be included in a cohort. The resulting cohort
when expanded with CentralAuth would include users from other wikis.
Dan pointed out that this would be extremely unlikely that a cohort
expanded using CentralAuth would include unattached users.
I'm inclined to not worry about the issue and move ahead with releasing the
feature.
Please discuss I'm missing something.
The next Research & Data showcase
<https://www.mediawiki.org/wiki/Analytics/Research_and_Data/Showcase> will
be live-streamed this Wednesday, 8/20 at 11.30 PT.
The streaming link will be posted on the lists a few minutes before the
showcase starts and as usual, you can join the conversation on IRC at
#wikimedia-research.
We look forward to seeing you!
Leila
This month:
*Everything You Know About Mobile Is WrW^Right: Editing and Reading Pattern
Variation Between User Types*
By *Oliver Keyes*: Using new geolocation tools, we look at reader and
editor behaviour to understand how and when people access and contribute to
our content. This is largely exploratory research, but has potential
implications for our A/B testing and how we understand both cultural
divides between reader and editor groups from different countries, and how
we understand the differences between types of edit and the editors who
make them.
*Wikipedia article curation: understanding quality, recommending tasks*
By *Morten Warncke-Wang**: In this talk we look at article curation in
Wikipedia through the lens of task suggestions and article quality. The
first part of the talk presents SuggestBot, the Wikipedia article
recommender. SuggestBot connects contributors with articles similar to
those they previously edited. In the second part of the talk, we discuss
Wikipedia article quality using “actionable” features, features that
contributors can easily act upon to improve article quality. We will first
discuss these features’ ability to predict article quality, before coming
back to SuggestBot and show how these predictions and actionable features
can be used to improve the suggestions.
*Bio: Morten Warncke-Wang is a PhD student at the GroupLens research lab,
University of Minnesota. His main research focus is artefact quality and
task recommendations in peer production communities. On the task
recommendation side he has maintained the Wikipedia article recommender
SuggestBot (http://en.wikipedia.org/wiki/User:SuggestBot) since 2010,
expanding it to support six languages and additional information about
recommended articles. His work on artefact quality looks at understanding
quality through features contributors can easily improve, using them to
both predict Wikipedia article quality and suggest improvement tasks to
Wikipedia contributors.
You can find more information about his research on his homepage:
http://www-users.cs.umn.edu/~morten/
Hello,
Ehsan Shahghasemi who is a PhD candidate in Communication is doing a
research for his dissertation which is about cross cultural schemata
Americans have of another nation. I will appreciate if you could kindly
help him by answering his questionnaire. It doesn't take more than 4
minutes:
https://docs.google.com/forms/d/1jnbxpxZdsUkJ7237bSL3daRBDNGEkqT1s8kqUobzAG…
Thanks in advance
Hi,
the dev team has committed to the following user stories for the sprint
starting today, ending August 19.
Bug ID
Component
Summary
Points
68731
Wikimetrics
Backing up wikimetrics data fails if data is written while we back it up
5
68833
Wikimetrics
session management
21
68840
EEVS
Wikimetrics can't run a lot of recurrent reports at the same time
8
67806
Wikimetrics
Story: EEVSUser loads static site in accordance to Pau's design
13
68507
Wikimetrics
replication lag may affect recurrent reports
8
Total Points: 55
You can see the sprint here:
http://sb.wmflabs.org/t/analytics-developers/2014-08-07/
Cheers,
Kevin Leduc
Hello Everyone
I want to work on a project mentioned at project_list
<https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects>
"Wikimedia
Performance Portal
<https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Wikime…>".
It is aimed to present data in graphical manner such that it represent
performance metrics about the Wikimedia cluster; and also to organize data
so that important data doesn't get mixed with unimportant ones.
I need to have access to data or at least some glimpse of it and its
annotations/description to work on it. From where I can access them?
I am new to FOSS world; want to work on this project because it is related
with data analytics which always attracts me. I am not proficient in data
analysis but yes want to be; so while doing this project I will be having
good experience which lead to my goal.
I have good hand in Python and Java, know basics of R, php, C, C++,
javascript and also wiling to learn rest all whatever needed.
I mailed about same to mentioned mentor for project but unfortunately
didn't got any response probably because he would be busy. So please can I
have guidance from where to start for this project and what it is all about
more.
Thanks!!!
Shaifali Agrawal
about.me/shaifaliagrawal
[image: Shaifali Agrawal on about.me]
<http://about.me/shaifaliagrawal>