Thank you Erik. I see that the squid browser reports for the months of August and September of this year now have country data. However, the data between the months of October 2013 and July 2014 remain unchanged. Also, with those data I notice that my browser reports that it was unable to get jquery-1.3.2.min.js and jquery.tablesorter.js.
Atul
Thank you Eric. I look forward to seeing those reports in their corrected for in due course. The fact that Wikimedia reports contain actual visitor numbers rather than mere percentages make them a really rich source of information that can be analyzed in a host of different ways - but not without the country names!
Atul Vaidya
Hi :-)
These are the largest Eventlogging tables on m2-master:
145G MobileWebClickTracking_5929948.ibd
94G PageContentSaveComplete_5588433.ibd
61G MediaViewer_8572637.ibd
57G MediaViewer_8245578.ibd
30G MultimediaViewerNetworkPerformance_7917896.ibd
29G MediaViewer_8935662.ibd
24G MobileWikiAppToCInteraction_8461467.ibd
Are these sizes roughly expected?
Anything we can discard or reduce?
Where did the discussion on purging data end up?
No immediate problems here, just rattling cages :-)
BR
/s
--
DBA @ WMF
Hi,
One of the longstanding issues with Webstatscollector is that it
counts redirects at the HTTP level.
So for example [1]:
- Requesting a page with a lower case first letter,
- Requesting a page from the desktop site on a mobile device, or
- Requesting to www.wikipedia.org (first part is www, not a language)
causes two requests to the caches, and webstatscollector counts both,
although actually only a single page is shown to the user.
Thereby too high numbers get reported.
Since we're about the deploy a fix for webstatscollector tomorrow
anyways, and this double counting should not be too hard to fix, let's
get it fixed too.
If you see value in counting redirects, please let us know soonish, as
I'll try to get the changes into tomorrow's deployment.
Sorry for the short notice,
Christian
[1] See the corresponding bug 71790
https://bugzilla.wikimedia.org/show_bug.cgi?id=71790
for concrete examples.
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hi,
people from gerrit's “Analytics” group [1] currently hold
* Push (including Force Push)
* Push Merge Commit
* Forge Author Identiy
* Forge Committer Identity
permissions on “analytics/*” projects in gerrit. But those permissions
got and get in the way one way or the other.
Do we need those permissions for our repos?
If no one objects, I'll start removing them on 2014-04-28.
Best regards,
Christian
[1] https://gerrit.wikimedia.org/r/#/admin/groups/uuid-d34747bee94be39cff54b5fd…
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Gruendbergstrasze 65a Email: christian(a)quelltextlich.at
4040 Linz, Austria Phone: +43 732 / 26 95 63
Fax: +43 732 / 26 95 63
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hi,
the machine that hosts
stats.wikimedia.orgdatasets.wikimedia.org
is experiencing problems, and hence the above sites are currently
unavailable.
Investigation is still going on.
We're tracking the issue at
https://bugzilla.wikimedia.org/show_bug.cgi?id=71686
Sorry for the inconveniences,
Christian
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hi,
in the week from 2014-09-22–2014-09-28 Andrew, Jeff, and I worked on
the following items around the Analytics Cluster and Analytics related
Ops:
* Accessing HDFS through plain file system
* Bringing Webstatscollector to Hive
* Presentation of Sqoop
* Using kafkatee to generate TSVs
* Packetloss alerts (Bug 71116)
(details below)
Have fun,
Christian
* Accessing HDFS through plain file system
As by-product of preparing to get cluster generated datasets to the
webservers, hdfs got (read-only) mounted on stat1002 into the plain
file system at /mnt/hdfs.
So you can now for example access the HDFS data files directly from
/mnt/hdfs/wmf/data
on stat1002.
Also, you no longer need to setup ssh tunnels and some such to get to
your logs. You can now just look at them from
/mnt/hdfs/var/log/hadoop-yarn/apps/
as plain files, and grep, tail, ... them.
* Bringing Webstatscollector to Hive
The webstatscollector reimplementation in Hive got merged and is
producing data since 2014-09-23. This implementation is
** no longer subject to the contiuous packet loss on udp2log [1],
** can rerun jobs if needed,
** contains pagecounts for all sites.
While researchers could already use the data on stat1002, legal
sign-off for publishing it to the public was still missing.
(We got it in the meantime, so publishing is imminent. But that will
be reported in the next weekly email)
* Presentation of Sqoop
More research around how to get MediaWiki databases into Hadoop was
done, and Sqoop is the tool of choice at this point in time.
The current possibilities of Sqoop and how one can use it to import
data into Hadoop has been demoed and discussed with researchers.
* Using kafkatee to generate TSVs
Discussions around kafkatee are still going on. But there is no
solution yet.
* TSV generation through Hive
Since kafkatee issues are not yet resolved, we followed up on previous
week's initial screening by doing a more thorough check on the
feasibility of generating the TSVs through Hive instead of kafkatee.
We cannot only cover the immediately needed TSVs for our researches,
but also cover the glam and fundraising tsvs. So we could do fully
without kafkatee.
Looking more closely at the implementation blockers, it seems there is
only a geocoding blocker. But we can overcome it with a little Java
coding.
We gave it a shot with the sampled-1000, mobile-sampled-100, and zero
tsvs, vetted the Hive-produced data and it worked smoothly.
So the way forward is solid and paved, in case kafkatee issues cannot
get resolved soonish.
* Packetloss alerts (Bug 71116)
On 2014-09-20 there were alerts around packet loss on udp2log. But
they turned out to be an artifact of on ULSFO outage [2].
[1] This is the first dataset that's coming out of the cluster, and
the cluster still has minor hiccups from time to time. But data
quality and reliability is already several orders of magnitude better
than udp2log has ever been. So it's already the prefered source for
datasets.
[2] https://lists.wikimedia.org/mailman/private/ops/2014-September/040429.html
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hi,
The fruits of our labor on Editor Engagement Vital Signs (EEVS) is on
display. This is still an early release, we have a backlog of feedback
from internal stakeholders and more iterations are to come.
https://metrics.wmflabs.org/static/public/dash/
This sprint’s commitments are:
Bug ID
Component
Summary
Points
69569
Wikimetrics
Story:d WikimetricsUser runs 'Rolling Recurring old active editors' report
13
67806
Visualization
Story: EEVSUser loads static site in accordance to Pau's design
13
71009
Wikimetrics
Update 'existing' Pages Created to include delete pages
5
71008
Wikimetrics
Update 'existing' Edits Metric to include deleted pages
5
70887
Dashiki
Story: Bookmarks / Statefull URL. Define protocol and use it to bootstrap
the dashboard and keep state
21
That’s 55 Points in 5 stories
Our progress is tracked in scrumbugs:
http://sb.wmflabs.org/t/analytics-developers/2014-09-18/
cheers,
Kevin Leduc