We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Hi Analytics,
On ENWP, does the number of 26,163,773 users include IPs who have made
edits? Does it include editors on all Wikimedia projects or just those who
have registered and/or edited on ENWP?
Thanks,
Pine
Hello,
I work for a consulting firm called Strategy&. We have been engaged by Facebook on behalf of Internet.org to conduct a study on assessing the state of connectivity globally. One key area of focus is the availability of relevant online content. We are using a the availability of encyclopedic knowledge in one's primary language as a proxy for relevant content. We define this as 100K+ Wikipedia articles in one's primary language. We have a few questions related to this analysis prior to publishing it:
* We are currently using the article count by language based on Wikimedia's foundation public link: Source: http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable source for article count - does it include stubs?
* Is it possible to get historic data for article count. It would be great to monitor the evolution of the metric we have defined over time?
* What are the biggest drivers you've seen for step change in the number of articles (e.g., number of active admins, machine translation, etc.)
* We had to map Wikipedia language codes to ISO 639-3 language codes in Ethnologue (source we are using for primary language data). The 2 language code for a wikipedia language in the "List of Wikipedias" sometimes matches but not always the ISO 639-1 code. Is there an easy way to do the mapping?
Many Thanks,
Rawia
[Description: Strategy& Logo]
Formerly Booz & Company
Rawia Abdel Samad
Direct: +9611985655 | Mobile: +97455153807
Email: Rawia.AbdelSamad(a)strategyand.pwc.com<mailto:Rawia.AbdelSamad@strategyand.pwc.com>
www.strategyand.com
You may have heard about the in-progress work on the Code of Conduct for
Wikimedia technical spaces
(https://www.mediawiki.org/wiki/Code_of_conduct_for_technical_spaces/Draft).
It is currently in draft form, and we are in the process of finalizing
the intro, "Principles", "Expected behavior" and "Unacceptable behavior"
sections.
An earlier version of these sections (except for "Expected behavior")
reached consensus.
However, there is now a new draft, and you can weigh in on whether to
use it instead:
https://www.mediawiki.org/wiki/Talk:Code_of_conduct_for_technical_spaces/Dr…
.
I will continue to ask for your feedback as we discuss the remaining
sections later.
Thanks,
Matt Flaschen
Last week we started up a new AB test[1] comparing the existing completion
suggestions against a new completion suggestion API. This very simply puts
1 in 10000 users into the test bucket, and a further 1 in 10000 users into
the control bucket like so:
- function oneIn(population) {
- return Math.floor( Math.random() * populationSize ) === 0;
- }
- if ( oneIn( 10000 ) ) {
- // test bucket
- } else if ( oneIn ( 10000 ) ) {
- // sample bucket
- } else {
- return; // rejected
- }
-
On every page load we generate a random 64 bit number via
`mw.user.generateRandomSessionId()`. This is used to correlate together
events that were performed by the same user on the same page. This is
logged with all our events as event_pageId. In older tests (this was
turned off September 3rd) using this same event_pageId scheme roughly 0.3%
of event_pageId values came from multiple IP addresses, which seems sane
and normal:
- mysql:research@analytics-store.eqiad.wmnet [log]> select count,
count(count) from (select count(distinct clientIp) as count from
TestSearchSatisfaction_12423691 group by event_pageId) x group by
count;
- +-------+--------------+
- | count | count(count) |
- +-------+--------------+
- | 1 | 411104 |
- | 2 | 1500 |
- +-------+--------------+
- 2 rows in set (3.11 sec)
-
On the test we just started though, we are seeing 48% of event_pageId
values being reported by multiple ip addresses. We can't seem to find any
way to explain why this has changed so much, and as such are uncertain we
can rely on the other data collected by this same test.
- mysql:research@analytics-store.eqiad.wmnet [log]> select count,
count(count) from (select count(distinct clientIp) as count from
CompletionSuggestions_13424343 group by event_pageId) x group by
count;
- +-------+--------------+
- | count | count(count) |
- +-------+--------------+
- | 1 | 1176 |
- | 2 | 243 |
- | 3 | 254 |
- | 4 | 212 |
- | 5 | 143 |
- | 6 | 102 |
- | 7 | 64 |
- | 8 | 36 |
- | 9 | 16 |
- | 10 | 14 |
- | 11 | 8 |
- | 12 | 5 |
- +-------+--------------+
- 12 rows in set (0.03 sec)
We have a third schema in production that has been collecting events the
entire time. It seems to have started showing this issue on September 10th
which lines up with a thursday train deployment:
mysql:research@analytics-store.eqiad.wmnet [log]> select date,
MAX(count) from (select substr(timestamp, 1, 8) as date, count(distinct
clientIp) as count from TestSearchSatisfaction2_13223897 group by
substr(timestamp, 1, 8), event_pageId) x group by date;;
- +----------+------------+
- | date | MAX(count) |
- +----------+------------+
- | 20150902 | 1 |
- | 20150903 | 2 |
- | 20150904 | 2 |
- | 20150905 | 4 |
- | 20150906 | 3 |
- | 20150907 | 3 |
- | 20150908 | 3 |
- | 20150909 | 3 |
- | 20150910 | 11 |
- | 20150911 | 12 |
- | 20150912 | 14 |
- | 20150913 | 18 |
- | 20150914 | 13 |
- +----------+------------+
- 13 rows in set (1.74 sec)
Does anyone have any ideas for where this change could have come from?
[1]
https://gerrit.wikimedia.org/r/#/c/236937/1/modules/ext.wikimediaEvents.sea…
Hi,
I need to reboot stat1001, stat1002, stat1003 to update the running Linux
kernels on these hosts.
I'm planning to make the reboots starting tomorrow 30th September at 13:00
UTC (6am pacific time).
If that is a bad time (e.g. because you have long-running or crucial
scripts runinng on one of them, please get in touch with me and we can move
it to another time.
Cheers,
Moritz
>From the intertubes:
@tlipcon: Super excited to finally talk about what I've been working on the
last 3 years: Kudu! http://t.co/1W4sqFBcyHhttp://t.co/1mZCwgdOO5
Might be useful for the media wiki tables.
-Toby
Hi WikiMedia Analytics,
I'm a student who has been doing work with the page count files from
wikimedia.
During the last few days, it looks like the latest page count is being
published slower than before.
Usually, when I go to the following link:
http://dumps.wikimedia.org/other/pagecounts-all-sites/2015/2015-09/.
I could see what happened an hour ago sometime within an hour or so after
that.
Is this property still going to be true? This seems to not be the case for
9/16 and 9/17.
Also, pagecounts-20150916-090000.gz
<http://dumps.wikimedia.org/other/pagecounts-all-sites/2015/2015-09/pagecoun…>
does
not seem to be of correct size.
Thanks,
Tony Ho
Hi Analytics,
your input on the analytics/wikistats task in
https://phabricator.wikimedia.org/T113695
is welcome to find the best way in the next weeks how to move forward.
Who could try to tackle this?
Thanks in advance for your help!
andre
--
Andre Klapper | Wikimedia Bugwrangler
http://blogs.gnome.org/aklapper/