Hi all!
For a while now, we’ve been hosting some public datasets at http://stat1001.wikimedia.org/public-datasets. We wanted to dissociate the domain that these datasets were hosted at from the actual server name, so, we did! The same data is now available at http://datasets.wikimedia.org. Redirects from stat1001.wikimedia.org are in place.
Let us know if you have any trouble.
Thanks!
-Ao
Hi,
the analytics dev team has committed to the following user stories for the
sprint starting today, ending September 2.
Bug ID
Component
Summary
Points
69297
Wikimetrics
Story: EEVS user does not see reports for projects without databases
3
68351
EEVS
Story: AnalyticsEng has website for EEVS
34
67806
EEVS
Story: EEVSUser loads static site in accordance to Pau's design
13
That’s 50 points in 3 Stories
You can see the sprint here:
http://sb.wmflabs.org/t/analytics-developers/2014-08-21/
Note:
Bug 68507 (replication lag may affect recurrent reports) is carried over
from the previous sprint and will be completed shortly.
Cheers,
Kevin Leduc
Hi,
in the week from 2014-08-25–2014-08-31 Andrew, Jeff, and I worked on
the following items around the Analytics Cluster and Analytics related
Ops:
* Analytics cluster feeding more logs into logstash
* More buffer for kafka brokers
* Life support for webstatscollector on udp2log
* Webstatscollector and kafka
* Webstatscollector counting https requests from ulsfo twice
(details below)
Have fun,
Christian
* Analytics cluster feeding more logs into logstash
The analytics cluster previously only fed logs from the worker nodes
into logstash, and now also feeds logs from namenodes into logstash.
* More buffer for kafka brokers
During partition leader re-elections, kafka brokers sometimes drop
a few log lines. Since the kafka broker buffers were smaller than the
time the re-election might take, the buffer size was increased, which
could help brokers to handle a partition leader re-election without
dropping messages.
* Life support for webstatscollector on udp2log
The production webstatscollector (the software that produces the
hourly pageview files, that are used for example by
stats.wikimedia.org, and stats.grok.se) that consumes from udp2log
started to produce faulty files. As another, no longer needed service
on the host that runs part of webstatscollector was greedy around
resources, this no longer needed service has been stopped to free up
more resources. Strangely enough, those additional resources made
webstatscollector misbehave even more. Disks could no longer handle
the load. After moving the service to writing to a RAM disk, the host
could handle the write load again. This switch not only allowed to
bring webstatscollector back to life, but also decreased packet loss
on the collector by a bit more than an order of magnitude.
* Webstatscollector and kafka
Last week we reported that we spun up a webstatscollector instance
that consumes from kafka instead of udp2log, and that the setup caused
some issues at first. We now monitored the “webstatscollector on
kafka” setup for a week, and it was producing the data extremely
reliably. So with this webstatscollector on kafka, we have a good
baseline to compare against when trying to scale up webstatscollector
to Hadoop.
* Webstatscollector counting https requests from ulsfo twice
While working on establishing the “webstatscollector on kafka”
baseline, it has been discovered that the udp2log webstatscollector
counts https requests from ulsfo twice. The corresponding fix has been
merged on the same day, but due to “no deploys on Fridays” the deploy
did not happen last week. (It has been deployed since, and numbers
look good)
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hi,
there has been another issue around webstatscollector---the software
that produces the raw numbers for stats.wikimedia.org and
stats.grok.se.
Due to a deployment regression, requests to Special:CentralAutoLogin/*
have been counted since 2014-07-07, although they should not have been
counted.
While this does /not/ impact per-page numbers [1], it results in
overreporting of the per-wiki numbers. The impact of this issue
strongly depends on the wiki (E.g.: jawiki: ~0.5%, ruwiki: ~2%).
The regression has been fixed in the meantime, and the corresponding
bug is
https://bugzilla.wikimedia.org/show_bug.cgi?id=70295
Best regards,
Christian
[1] So pages like
http://stats.grok.se/en/latest30/Main_Page
are not affected by this bug.
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hello!
I've been working for the last few days on
https://github.com/Ironholds/WPDMZ, which currently generates raw data
on 'number of non-bot edits per country', and I'd like to run some
stats / make some graphs based on it. Since I'd like al l my
'research' to be completely repeatable, I'd love it if we can make the
'raw data' (edits per country) publicly available on labsdb. I have
most of the code written for it, *but* it needs anonymization.
The biggest de-anonymization threats involve identifying which editors
come from which countries, and can be executed in the following case:
An editor is the only person editing from a country in a project where
the country has low edit volume, and by a process of elimination /
counting edits from a public source (like recentchanges), the
individual editor can be connected to a particular country
I propose the following Anonymization scheme:
1. No data for projects with less than a threshold of total
*individual editors* in the time period for which the data is
released.
2. For countries that have less than a threshold % of 'individual
editors' in the time period, we just simply lump them in as 'other'.
This removes most anonymization attacks I can think of. Thoughts? I
can easily write up the code to generate these on a monthly basis and
puppetize those to make the data publicly available. I think not just
me, but lots of external researchers would benefit from such data.
Thanks!
--
Yuvi Panda T
http://yuvi.in/blog