Hey folks,
As discussions on the new page view definition have been calming down,
we're preparing to deliver a draft version to the Devs. I want to make
sure that we all know the status and that any substantial concerns are
raised before we hand things off on *Friday, Dec 12th.*
For this phase, we are delivering the general filter[1]. This is the
highest level filter, and exists primarily to distinguish requests worthy
of further evaluation. Our plan is to take the definition as it exists on
the 12th, and begin generating high-level aggregate numbers based on it. In
future iterations, we will be digging into different breakdowns of this
metric, and iterating on it to handle any inconsistencies or unexpected
results. There's a few differences from Web Stat Collector's (WSC) version
of the general filter that we want to call to your attention to.
- We include searches -- WSC explicitly excludes them.
- We include Apps traffic -- WSC does not detect Apps traffic
- We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) -- WSC
hardcodes "/wiki/"
- We don't include Banner impressions -- WSC includes them.
There are also some known issues with the new definition that are worth
your notice:
1. *Internal traffic is counted*
- Note that WSC filters some internal traffic by hardcoding a set of IPs
in the definition. We are working on parsing puppet templates in order to
automatically detect which IPs represent internal traffic. This will be a
/better/ solution, but it's not quite ready yet because parsing puppet is
hard.
1. *Spider traffic is counted*
- We will be using the User-agent field to detect and flag spider-based
traffic. This "tag definition" will be delivered in a subsequent
definition. This actually matches WSC, which does not filter spider for
the high-level metrics.
These are problems we're aware of, and will be factoring in as we go
forward with our next task: refining the definition using real,
hourly-level traffic data. Thanks to everyone who has given feedback and
participated in the process thus far, particularly Nemo, Erik, and
Christian.
1. https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
-Aaron & Oliver
I’m glad to announce the release of an open-licensed corpus with 1.5M records from the Article Feedback v5 pilot.
http://dx.doi.org/10.6084/m9.figshare.1277784
Thanks to everyone who helped make this happen, Fabrice in particular for shepherding this through.
Dario
—
This dataset contains the entire corpus of feedback submitted on the English, French and German Wikipedia during the Article Feedback v.5 pilot (AFT). [1] The Wikimedia Foundation ran the Article Feedback pilot for a year between March 2013 and March 2014. During the pilot, 1,549,842 feedback messages were collected across the three languages.
All feedback messages and their metadata (as described in this schema [2]) are available in this dataset, with the exception of messages that have been oversighted and/or deleted by the end of the pilot.
The corpus is released [3] under the following license:
• CC BY SA 3.0 for feedback messages
• CC0 for the associated metadata
Results from the pilot are discussed in: Halfaker, A., Keyes, O. and Taraborelli, D (2013). Making peripheral participation legitimate: Reader engagement experiments in Wikipedia. CSCW ’13 Proceedings of the 2013 Conference on Computer Supported Cooperative Work [4][5]
[1] https://www.mediawiki.org/wiki/Article_feedback/Version_5
[2] https://www.mediawiki.org/wiki/Article_feedback/Version_5/Technical_Design_…
[3] https://wikimediafoundation.org/wiki/Feedback_data#Article_Feedback
[4] http://dx.doi.org/10.1145/2441776.2441872
[5] http://nitens.org/docs/cscw13.pdf
This last few days analytics-store replication has started to lag by some
hours. Currently s1 (enwiki) and s5 (dewiki, wikidatawiki) are most
affected. Eventlogging is not lagging, due to the nicely batched writes it
does now-a-days :-)
There are many slow queries running from the research user on stat1003,
referencing eventlogging tables like MobileWebDiffClickTracking*.
I'm not sure who belongs to them, or if they're new, or if they're safe to
kill, so this is mainly a heads-up email. Let us know if ops should kill
stuff to let the box catch up again.
BR
Sean
--
DBA @ WMF
(sending to public list)
I have started a doc in wikitech that describes an oozie 101 example and
goes a little into how to troubleshoot oozie jobs. Still WIP.
Will update as work progresses:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Oozie
Please edit/correct as needed.
Hi all!
Ops would like us to move the stat* boxes inside of the analytics VLAN. I need to just pick a date for this to happen.
I’m not entirely sure how long this will all take, so I’d like to schedule an entire day for these to potentially be offline. How about Thursday December 18th? If there are objections, I can find another day.
Thanks!
-Andrew Otto
Hi,
it seems some tables of the EventLogging database are no longer
replicating since around 2014-12-09T11:32 [1] on both dbstore1002 and
db1047 (s1-analytics-slave).
Strangely enough, it seems that it only affects some tables. Like for
example
* NavigationTiming_10374055, and
* SaveTiming_10077760
are lagging, while
* Echo_7731316
* MobileWebEditing_8599025
are up-to-date. I filed RT #9016 to track the issue
https://rt.wikimedia.org/Ticket/Display.html?id=9016
Best regards,
Christian
[1] The m2 master database is showing current events, so this is
merely a replication issue.
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------