I'm a phd student studying mathematical models to improve the hit ratio
of web caches. In my research community, we are lacking realistic data
sets and frequently rely on outdated modelling assumptions.
Previously, (~2007) a trace containing 10% of user requests issued to
the Wikipedia was publicly released . This data set has been used
widely for performance evaluations of new caching algorithms, e.g., for
the new Caffeine caching framework for Java .
I would like to ask for your comments about compiling a similar
(updated) data set and making it public.
In my understanding, the necessary logs are readily available, e.g., in
the Analytics/Data/Mobile requests stream  on stat1002, with a
sampling rate of 1:100. As this request stream contains sensitive data
(e.g., client IPs), it would need anonymization before making it public.
It would be glad to help with that.
The previously released data set  contains no client information. It
contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an update
flag. I would additionally suggest to include 5) the cache's hostname,
6) the cache_status, and 7) the response size (from the Wikimedia cache
I believe this format would preserve anonymity, and would be interesting
for many researchers.
Let me know your thoughts.
I hacked up a very quick count of the 2015 video viewing aggregate
figures, using the data that Bartosz put together last year - with the
caveat that the data only goes up to 10 December, but it's probably
indicative of whole-year trends. I haven't yet tried to merge in the
11-31/12 data. Nothing very insightful but I don't recall seeing it
done before, so it might be of interest!
The headline figure is that we had about three billion (!!)
video/audio plays during the year, and that some of the most popular
items are insanely popular - the most popular was viewed an average of
42,000 times a day, every day.
Pine: the video you asked about in the other thread was viewed 187,899
times from 31/10/15 to 10/12/15. So there's half your answer :-)
- Andrew Gray
We’ve gotten good participation as we’ve worked on sections of the Code
of Conduct over the past few months, and have made considerable
improvements to the draft based on your feedback.
Given that, and the community approval through the discussions on each
section, the best approach is to proceed by approving section-by-section
until the last section is done.
So, please continue to improve the Code of Conduct by participating now
and as future sections are discussed. When the last section is
completed and approved on the talk page, the Code of Conduct will become
policy and no longer be marked as a draft.
Also, two more discussions regarding the Code of Conduct have been
resolved and incorporated into the draft.
* "Enforcement issues" addressed the reporting process and clarified
that Committee decisions could not be circumvented
* "Marginalized and underrepresented groups" forbids discrimination
There seems to be issues with the Edit_13457736 table on the db log,
on the db1047 host.
db1047 may be known better by some of you as you as
"s1-analytics-slave" or "analytics-slave". You can use dbstore1002 for
the time being ("analytics-store" or "sX-analytics-slave", where X is
2-7) to read from that table. Other "schemas" are not affected.
I am fixing those right now, but it will take some time, as I may have
to handle (again) the recently purged rows on that table. Worst case
scenario, I may have to do a quick reboot of that machine.
I will update the status soon to communicate next steps.
I usually send these to multiple lists, but I realized I forgot to send
this to the ones besides wikitech-l.
The "Marginalized and underrepresented groups" discussion
is still open. I'll probably give it two weeks total, which means
closing it late tomorrow.
-------- Forwarded Message --------
Subject: Please provide feedback on new discrimination and enforcement
sections of Code of Conduct
Date: Wed, 16 Mar 2016 20:23:24 -0400
From: Matthew Flaschen <mflaschen(a)wikimedia.org>
To: Wikitech List <wikitech-l(a)lists.wikimedia.org>
Thanks for your participation in the recent Code of Conduct discussions.
The "Marginalized and underrepresented groups" discussion had a lot of
feedback. There was not consensus to use the exact original wording,
but many people expressed willingness to support a modified text.
I've proposed such a new text, based on Neil P. Quinn's text, with a
small modification to account for discrimination required by law (e.g.
age of people who can sign certain contracts).
Please participate at
The "Enforcement issues" section received general support, but some of
that was conditional, or expressed preference for wording that developed
during the discussion. The original wording also did not address the
appeals body, which was raised in the discussion.
Please participate at
Update regarding completed discussions:
The "Clarification of legitimate reasons for publication of private
communications and identity protection" and "Definitions - trolling,
bad-faith reports" discussions have been closed.
They both had support, and I've incorporated the text into the draft.
Just to let you know that after the modifications to the Edit table in EL
database, the reports have been able to catch up and back-fill until today.
So https://edit-analysis.wmflabs.org/compare/ is working again.
*Marcel Ruiz Forns*
We would like to have URI addresses of requests for some time of usage -
let's say 1 month.
According to the data format
attributes of Webrequests we need are following:
Do we need to go through NDA process or it is possible to get the data
right away from the public dataset?
> Can you be more specific about what you need, Michal? If you truly
> need access to the private data that we keep in wmf.webrequest for a
> limited time, then you'd have to go through a process to sign an NDA.
> But if you tell us what you need, there may be a public dataset that
> you can use.
> On Thu, Mar 3, 2016 at 2:48 PM, Michal Bystricky
> <michal.bystricky(a)stuba.sk <mailto:firstname.lastname@example.org>> wrote:
> Hello Analytics Team,
> We would like to have one-time access to wmf.webrequest data. What
> is the correct way of accessing the data?
> In our research group, we want to simulate the requests for
> specific version of WikiMedia.
> Michal Bystricky
> Analytics mailing list
> Analytics(a)lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
We're happy to announce a few improvements to Analytics data releases on
* We are releasing a new dataset, an estimate of Unique Devices accessing
our projects 
* We are officially making available a better Pageviews dataset 
* We are deprecating two older pageview statistics datasets
* We moved Analytics data from /other to /analytics 
*Unique Devices:* Since 2009, the Wikimedia Foundation used comScore to
report data about unique web visitors. In January 2016, however, we
decided to stop reporting comScore numbers  because of certain
limitations in the methodology, these limitations translated into
misreported mobile usage. We are now ready to replace comscore numbers with
the Unique Devices Dataset . While unique devices does not equal
unique visitors, it is a good proxy for that metric, meaning that a major
increase in the number of unique devices is likely to come from an increase
in distinct users. We understand that counting uniques raises fairly big
privacy concerns and we use a very private conscious way to count unique
devices, it does not include any cookie by which your browser history can
be tracked .
We invite you to explore this new dataset and hope it’s helpful for the
Wikimedia community in better understanding our projects. This data can
help measurethe reach of wikimedia projects on the web.
*Pageviews:* This  is the best quality data available for counting the
number of pageviews our projects receive at the article and project level.
We've upgraded from pagecounts-raw to pagecounts-all-sites, and now to
pageviews, in order to filter out more spider traffic and measure something
closer to what we think is a real user viewing content. A short history
might be useful:
* pagecounts-raw: was maintained by Domas Mituzas originally and taken
over by the analytics team. It was and still is the most used dataset,
though it has some majore problems. It does not count access to the mobile
site, it does not filter out spider or bot traffic, and it suffers from
unknown loss due to logging infrastructure limitations.
* pagecounts-all-sites: uses the same pageview definition as
pagecounts-raw, and so also does not filter out spider or bot traffic. But
it does include access to mobile and zero sites, and is built on a more
reliable logging infrastructure.
* pagecounts-ez: is derived from the best data available at the time.
So until December 2015, it was based on pagecounts-raw and
pagecounts-all-sites, and now it's based on pageviews. This dataset is
great because it compresses very large files without losing any
information, still providing hourly page and project level statistics.
So the new dataset, pageviews, is what's behind our pageview API and is now
available in static files for bulk download back to May 2015. But the
multiple ways to download pageview data is confusing for consumers, so
we're keeping only pageviews and pagecounts-ez and deprecating the other
two. If you'd like to read more about the current pageview definition,
details are on the research page .
*Deprecating:* We are deprecating the pagecounts-raw and
pagecounts-all-sites datasets in May 2016 (discussion here:
https://phabricator.wikimedia.org/T130656 ). This data suffers from many
artifacts, lack of mobile data, and/or infrastructure problems, and so is
not comparable to the new way we track pageviews. It will remain here
because we have historical data that may be useful, but it will not be
maintained or updated beyond May 2016.
*Clean-up:* Analytics data on dumps was crammed into /other with unrelated
datasets. We made a new page to receive current and future datasets 
and linked to it from /other and /. Please let us know if anything there
looks confusing or opaque and I'll be happy to clarify.