We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Roan:
The data for Echo schema(https://meta.wikimedia.org/wiki/Schema:Echo) is
quite large and we are not sure is even used.
Can you confirm either way? If it is no longer used we will stop collecting
it.
Thanks,
Nuria
I should have started this discussion a while ago, but it's easier to catch
up on work during vacation :)
We currently have 3 available static file dumps of pageview data. I will
explain them here and explain my thoughts on simplifying the situation.
Feel free to turn this thread into a wiki.
* PAGECOUNTS-RAW <http://dumps.wikimedia.org/other/pagecounts-raw/>. We
have this data going back to 2007. This is using a very simple pageview
definition which incorrectly counts things like banner views as pageviews
(for example).
* PAGECOUNTS-ALL-SITES
<http://dumps.wikimedia.org/other/pagecounts-all-sites/>. We have this
data starting in late 2014. Compared to PAGECOUNTS-RAW, this dataset also
adds traffic from the mobile versions of our sites. But it's still using
the same simple pageview definition.
* PAGEVIEWS <http://dumps.wikimedia.org/other/pageviews/>. We have this
data starting in May 2015. It implements the new and much improved
pageview definition <https://meta.wikimedia.org/wiki/Research:Page_view>
that we now use. This is the same pageview definition used in the pageview
API. This dataset also removes spider traffic and any automata traffic
that we can detect.
All three datasets are in the same format (Domasz's archive format).
So, before we can simplify this confusing situation, we need your help and
input about what to keep and how to keep it. Here's the approach I would
take:
Combine pagecounts-raw with pagecounts-all-sites into a new dataset called
"pagecounts". Keep producing data to this dataset forever, but remove
"pagecounts-raw" and "pagecounts-all-sites". This way, we can compare new
data with historical data going back as far as we need. We would explain
on dumps.wikimedia.org/other that this dataset gains mobile data starting
in October 2014, to explain the relative local spike that happens there.
This dataset would remain a pretty bad estimate of actual page views, and
would remain sensitive to automata and spider spikes. But in combination
with the "pageviews" dataset, I think it would be useful.
What do you all think? Sound off in this thread, and if we have consensus
I'll start the cleanup.
Hi,
I am interested to know if wikipedia makes public how many backlinks each page gets.
I am working on a search for wikipedia, and I as you would expect, it sucks.
So I went and tested same searches directly on wikipedia, and no offence, they suck even more.
So I went on Google, and performed same searches, with the added site:wikipedia.org, and Google was a little bit better (although not much compared with my 1-day-development-seach-engine).
I want to make my wikipedia search better, and having a table that would tell me how many non-wikipedia pages point to a certain wikipedia page, might improve my algorithm.
Anyone knows if wikipedia publishes such data?
Thank you!
Edison Nica
Http://www.0pii.com
Edisonn(a)0pii.com
Sent from my T-Mobile 4G LTE Device
Team:
This schema MobileWikiAppShareAFact is sending a lot of events, maybe is
worth thinking whether we need that many. It is again a case where tables
are becoming huge and hard to query fast.
cc-ing Jon as schema owner.
Can this data be sampled at a higher sampling rate? I have filed a ticket
to this fact:
https://phabricator.wikimedia.org/T122224
Thanks,
Nuria
On Tue, Dec 22, 2015 at 8:35 AM, Adam Baso <abaso(a)wikimedia.org> wrote:
> Replacing mobile-tech with mobile-l (internal mobile-tech list
> discontinued).
>
>
> On Tuesday, December 22, 2015, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
>
>> Team:
>>
>> As part of our effort of converting eventlogging mysql database to the
>> tokudb engine we need to stop eventlogging events from flowing into the MobileWikiAppShareAFact
>> table, we are using this one table to see how long the conversion will take
>> in order to plan for a larger outage window.
>>
>>
>> Let us know if data should be backfilled as it can be, we anticipate
>> events will not flow into table for the better part of one day.
>>
>>
>> Thanks,
>>
>> Nuria
>>
>>
>>
> _______________________________________________
> Mobile-l mailing list
> Mobile-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/mobile-l
>
>
Hi Analytics,
The report card site's most recent data for the unique visitors stats is
from May 2015. Will this be updated in the future?
Also, the information shown on the "New Editors Per Month for All Wikimedia
Projects" chart goes back only to late 2012. Is there a way to get the data
for that chart all the way back to 2001? I can pull the tables for all
Wikipedias back to 2001 from the report cards site, but I can't pull the
tables for all Wikimedia projects back to 2001 AFAIK.
Thanks!
Pine
Hi,
I see that the (amazing!) API still can't give us results for the whole
2015. So any way we can get this pages views per project? And also, the
most edited articles in 2015 per project?
This can be a great PR information for the communication representatives
around to world to release to local journalists.
*Regards,Itzik Edri*
Chairperson, Wikimedia Israel
+972-(0)-54-5878078 | http://www.wikimedia.org.il
Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment!
Hey folks.
I'm looking at.
https://dumps.wikimedia.org/other/pagecounts-all-sites/2015/
Can anyone tell me where I'd fine these files via stat1003? I'm pretty
sure I'm getting the pagecount dumps in /mnt/data/pagecounts/, but maybe
I'm mistaken.
-Aaron
Hi all,
Soon, we will be merging the mobile web cache requests with the text cache requests. text caches will now serve requests for mobile web[1].
This means that the webrequest_source=‘mobile’ partition in the webrequest table in Hive will soon be empty, and all data that was previously in it will be found in the webrequest_source=‘text’ partition.
There are only 3 datasets that currently only use the webrequest_source=‘mobile’ partition:
- /a/log/webrequest/archive/mobile
- /a/log/webrequest/archive/5xx-mobile
- /a/log/webrequest/archive/zero
(These are paths on stat1002, but they also exist in HDFS.)
These datasets originally came from udp2log, but since early last year they have been generated from Hadoop. With the upcoming cache merge, these jobs will have to parse through all text requests, which will make Hadoop busier.
Do we know if these are being used? Would anyone be upset if we no longer generated these datasets?
Thanks!
-Andrew
[1] https://phabricator.wikimedia.org/T109286 <https://phabricator.wikimedia.org/T109286>