Re: [Wiki-research-l] [Wikimedia-l] wikipedia access traces ?

List overview All Threads
Download

newer

older

Re: [Wiki-research-l]...

Re: [Wiki-research-l] 'Wikipedia...

Pine W

10 Sep 2014 10 Sep '14

2:54 p.m.

Hi Valerio, This kind of request is a better fit for the Research mailing list. I've included the email for that list in the To: line of this email reply. Pine On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni < valerio.schiavoni(a)gmail.com> wrote:

...

Dear WikiMedia foundation, in the context of a EU research project [1], we are interested in accessing wikipedia access traces. In the past, such traces were given for research purposes to other groups [2]. Unfortunately, only a small percentage (10%) of that trace has been made made available (10%). We are interested in accessing the totality of that same trace (or even better, a more recent one, but the same one will do). If this is not the correct ML to use for such requests, could please anyone redirect me to correct one ? Thanks again for your attention, Valerio Schiavoni Post-Doc Researcher University of Neuchatel, Switzerland 1 - http://www.leads-project.eu 2 - http://www.wikibench.eu/?page_id=60 _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

Attachments:

attachment.htm (text/html — 2.1 KB)

Show replies by thread

Valerio Schiavoni

17 Sep 17 Sep

4:53 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

...

On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni < valerio.schiavoni(a)gmail.com> wrote:

Aaron Halfaker

5:03 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

Just to confirm, https://dumps.wikimedia.org/other/pagecounts-raw/ won't work for you? On Wed, Sep 17, 2014 at 8:53 AM, Valerio Schiavoni < valerio.schiavoni(a)gmail.com> wrote:

...

On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni < valerio.schiavoni(a)gmail.com> wrote:

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Giovanni Luca Ciampaglia

5:09 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

...

On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni < valerio.schiavoni(a)gmail.com> wrote:

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Valerio Schiavoni

5:14 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

Hello Aaron, thanks for your reply. On Wed, Sep 17, 2014 at 4:03 PM, Aaron Halfaker <aaron.halfaker(a)gmail.com> wrote:

...

Just to confirm, https://dumps.wikimedia.org/other/pagecounts-raw/ won't work for you?

Unfortunately, no. Those logs only provide page counts but without the associated timestamps ("when" those pages have been accessed). If such logs exist, they would perfectly do..

...

2 - http://www.wikibench.eu/?page_id=60 >>> >>

By comparison, the logs in that dataset looks like this: 3325795636 1191194118.711 http://en.wikipedia.org/w/index.php?title=MediaWiki:Monobook.css&usemsg… - 3325795635 1191194118.803 http://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/Icono_aviso_borrar… - 3325795639 1191194118.671 http://de.wikipedia.org/w/index.php?title=MediaWiki:Monobook.css&usemsg… - The first token is just a counter, the second one is a Unix timestamp then there is the Wikipedia URL in the request, and a flag indicating if the request issued a database update or not (none of those three did). best, Valerio

Valerio Schiavoni

5:24 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

...

On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni < valerio.schiavoni(a)gmail.com> wrote:

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Valerio Schiavoni

5:44 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

Hello Giovanni, on second thought, I think the Click dataset won't do either. I've parsed the smaller sample [1], which is said to be extracted from the bigger one. In that dataset there are ~34k entries related to Wikipedia, but they look like the following: {"count": 1, "timestamp": 1257181201, "from": "en.wikipedia.org", "to": " ko.wikipedia.org"} That is, the log only reports the host/domain accessed, but not the specific URL being requested (to be clear, the one in the HTTP request issued by the client). This is what is of main interest to me. Thanks for your interest anyway! Valerio 1 - http://carl.cs.indiana.edu/data/#traffic-websci14 On Wed, Sep 17, 2014 at 4:24 PM, Valerio Schiavoni < valerio.schiavoni(a)gmail.com> wrote:

...

On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni < valerio.schiavoni(a)gmail.com> wrote: > Dear WikiMedia foundation, > in the context of a EU research project [1], we are interested in > accessing > wikipedia access traces. > In the past, such traces were given for research purposes to other > groups > [2]. > Unfortunately, only a small percentage (10%) of that trace has been > made > made available (10%). > We are interested in accessing the totality of that same trace (or even > better, a more recent one, but the same one will do). > > If this is not the correct ML to use for such requests, could please > anyone > redirect me to correct one ? > > Thanks again for your attention, > > Valerio Schiavoni > Post-Doc Researcher > University of Neuchatel, Switzerland > > 1 - http://www.leads-project.eu > 2 - http://www.wikibench.eu/?page_id=60 >

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Aaron Halfaker

6:23 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

Hi Valerio, The page counts dataset has a time resolution of one hour. Is that too coarse? How fine of resolution do you need? On Wed, Sep 17, 2014 at 9:44 AM, Valerio Schiavoni < valerio.schiavoni(a)gmail.com> wrote:

...

Hello, just bumping my email from last week, since so far I did not get any answer. Should I consider that dataset to be somehow lost ? I've also contacted the researchers who partially released it, but making it publicly available is tricky for them, due to its size (12 TB), which might instead be somehow in the norms of the operations taken daily by Wikipedia servers. Thanks again, Valerio > > On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni < > valerio.schiavoni(a)gmail.com> wrote: > >> Dear WikiMedia foundation, >> in the context of a EU research project [1], we are interested in >> accessing >> wikipedia access traces. >> In the past, such traces were given for research purposes to other >> groups >> [2]. >> Unfortunately, only a small percentage (10%) of that trace has been >> made >> made available (10%). >> We are interested in accessing the totality of that same trace (or >> even >> better, a more recent one, but the same one will do). >> >> If this is not the correct ML to use for such requests, could please >> anyone >> redirect me to correct one ? >> >> Thanks again for your attention, >> >> Valerio Schiavoni >> Post-Doc Researcher >> University of Neuchatel, Switzerland >> >> 1 - http://www.leads-project.eu >> 2 - http://www.wikibench.eu/?page_id=60 >> > > _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Valerio Schiavoni

6:39 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

Hello Aaron, 1 hour is way too coarse. Let's say 1 second would be ok. Is that available ? On Wed, Sep 17, 2014 at 5:23 PM, Aaron Halfaker <aaron.halfaker(a)gmail.com> wrote:

...

Valerio, I didn't know such data existed. As an alternative, perhaps you could have a look at our click datasets, which contain requests to the Web at large (i.e., not just Wikipedia) generated from within the campus of Indiana University over a period of several months. HTH http://carl.cs.indiana.edu/data/#click Cheers G Giovanni Luca Ciampaglia ✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA ☞ http://www.glciampaglia.com/ ✆ +1 812 855-7261 ✉ gciampag(a)indiana.edu 2014-09-17 9:53 GMT-04:00 Valerio Schiavoni < valerio.schiavoni(a)gmail.com>gt;: > Hello, > just bumping my email from last week, since so far I did not get any > answer. > > Should I consider that dataset to be somehow lost ? > > I've also contacted the researchers who partially released it, but > making it publicly available is tricky for them, due to its size (12 TB), > which might instead be somehow in the norms of the operations taken daily > by Wikipedia servers. > > Thanks again, > Valerio > >> >> On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni < >> valerio.schiavoni(a)gmail.com> wrote: >> >>> Dear WikiMedia foundation, >>> in the context of a EU research project [1], we are interested in >>> accessing >>> wikipedia access traces. >>> In the past, such traces were given for research purposes to other >>> groups >>> [2]. >>> Unfortunately, only a small percentage (10%) of that trace has been >>> made >>> made available (10%). >>> We are interested in accessing the totality of that same trace (or >>> even >>> better, a more recent one, but the same one will do). >>> >>> If this is not the correct ML to use for such requests, could please >>> anyone >>> redirect me to correct one ? >>> >>> Thanks again for your attention, >>> >>> Valerio Schiavoni >>> Post-Doc Researcher >>> University of Neuchatel, Switzerland >>> >>> 1 - http://www.leads-project.eu >>> 2 - http://www.wikibench.eu/?page_id=60 >>> >> >> > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Aaron Halfaker

6:56 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

I don't think that we keep those logs historically. analytics-l (CC'd) might have more insights. Do we have anything more granular than the hourly view logs available here: https://dumps.wikimedia.org/other/pagecounts-raw/ On Wed, Sep 17, 2014 at 10:39 AM, Valerio Schiavoni < valerio.schiavoni(a)gmail.com> wrote:

...

Hello Aaron, 1 hour is way too coarse. Let's say 1 second would be ok. Is that available ? On Wed, Sep 17, 2014 at 5:23 PM, Aaron Halfaker <aaron.halfaker(a)gmail.com> wrote:

Hello Giovanni, on second thought, I think the Click dataset won't do either. I've parsed the smaller sample [1], which is said to be extracted from the bigger one. In that dataset there are ~34k entries related to Wikipedia, but they look like the following: {"count": 1, "timestamp": 1257181201, "from": "en.wikipedia.org", "to": "ko.wikipedia.org"} That is, the log only reports the host/domain accessed, but not the specific URL being requested (to be clear, the one in the HTTP request issued by the client). This is what is of main interest to me. Thanks for your interest anyway! Valerio 1 - http://carl.cs.indiana.edu/data/#traffic-websci14 On Wed, Sep 17, 2014 at 4:24 PM, Valerio Schiavoni < valerio.schiavoni(a)gmail.com> wrote:

Hello Giovanni, thanks for the pointer to the Click datasets. I'd have to take a look at the complete dataset, to see how much of those requests are touching wikipedia. Then, one of the requirements to access those datas is: "The Click Dataset is large (~2.5 TB compressed), which requires that it be transferred on a physical hard drive. You will have to provide the drive as well as pre-paid return shipment. " I have to check if this is possible and how long this might take to ship and send back an hard-drive from Switzerland. I'll let you know !! Best, Valerio On Wed, Sep 17, 2014 at 4:09 PM, Giovanni Luca Ciampaglia < gciampag(a)indiana.edu> wrote: > Valerio, > > I didn't know such data existed. As an alternative, perhaps you could > have a look at our click datasets, which contain requests to the Web at > large (i.e., not just Wikipedia) generated from within the campus of > Indiana University over a period of several months. HTH > > http://carl.cs.indiana.edu/data/#click > > Cheers > > G > > Giovanni Luca Ciampaglia > > ✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA > ☞ http://www.glciampaglia.com/ > ✆ +1 812 855-7261 > ✉ gciampag(a)indiana.edu > > 2014-09-17 9:53 GMT-04:00 Valerio Schiavoni < > valerio.schiavoni(a)gmail.com>gt;: > >> Hello, >> just bumping my email from last week, since so far I did not get any >> answer. >> >> Should I consider that dataset to be somehow lost ? >> >> I've also contacted the researchers who partially released it, but >> making it publicly available is tricky for them, due to its size (12 TB), >> which might instead be somehow in the norms of the operations taken daily >> by Wikipedia servers. >> >> Thanks again, >> Valerio >> >>> >>> On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni < >>> valerio.schiavoni(a)gmail.com> wrote: >>> >>>> Dear WikiMedia foundation, >>>> in the context of a EU research project [1], we are interested in >>>> accessing >>>> wikipedia access traces. >>>> In the past, such traces were given for research purposes to other >>>> groups >>>> [2]. >>>> Unfortunately, only a small percentage (10%) of that trace has been >>>> made >>>> made available (10%). >>>> We are interested in accessing the totality of that same trace (or >>>> even >>>> better, a more recent one, but the same one will do). >>>> >>>> If this is not the correct ML to use for such requests, could >>>> please anyone >>>> redirect me to correct one ? >>>> >>>> Thanks again for your attention, >>>> >>>> Valerio Schiavoni >>>> Post-Doc Researcher >>>> University of Neuchatel, Switzerland >>>> >>>> 1 - http://www.leads-project.eu >>>> 2 - http://www.wikibench.eu/?page_id=60 >>>> >>> >>> >> >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > >

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Benj. Mako Hill

18 Sep 18 Sep

10:03 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

...

Unfortunately, no. Those logs only provide page counts but without the associated timestamps ("when" those pages have been accessed). If such logs exist, they would perfectly do..

The pagecount data /has/ timing data but they are "binned" by the hour. I don't think more comprehensive data (all pages, all languages, nearly all viewers) over a long period of time exists anywhere and I don't think any similarly comprehensive data exists before 2007 at all. You might find more granular data for short periods of time (like the WikiBench data or maybe stuff that's been collected more recently by WMF but isn't published) or much more detailed data from longer periods of time for a subset of users on a particular network (perhaps like the Indiana data, or toolbar data like the Yahoo data that some WP researchers have used). I would /love/ to hear that I am wrong about this and that there's some wonderful, granual, broad, long-term dataset of pageviews I just don't know about it. :) Later, Mako -- Benjamin Mako Hill http://mako.cc/ Creativity can be a social contribution, but only in so far as society is free to use the results. --GNU Manifesto

Pine W

10:07 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

I suppose you could get more granular data by conducting an opt-in study of some kind, and you would need to be careful that users who haven't opted in are not accidentally included or indirectly have their privacy affected. I agree that collection at intervals shorter than an hour is going to raise a lot of privacy considerations for users who have not opted in. Pine On Thu, Sep 18, 2014 at 12:03 PM, Benj. Mako Hill <mako(a)atdot.cc> wrote:

...

Unfortunately, no. Those logs only provide page counts but without the associated timestamps ("when" those pages have been accessed). If such

logs

exist, they would perfectly do..

Benj. Mako Hill

11:31 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

...

That would certainly work for some research questions and that's more or less what most toolbar data is. The problem is that often questions answered with view data are about the overall popularity of visibility of pages which requires data that is representative. There's lots of reasons to believe that people who opt-in aren't going to be representative of all Wikipedia readers. Regards, Mako -- Benjamin Mako Hill http://mako.cc/ Creativity can be a social contribution, but only in so far as society is free to use the results. --GNU Manifesto

Pine W

11:49 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

Yes, but supposedly phone survey companies are able to get representative samples of broad populations despite many people refusing to respond to phone surveys. If opt-in users were chosen using similar methods, could arguably representative data be obtained? Pine On Sep 18, 2014 1:32 PM, "Benj. Mako Hill" <mako(a)atdot.cc> wrote:

...

I suppose you could get more granular data by conducting an opt-in study

some kind, and you would need to be careful that users who haven't opted

are not accidentally included or indirectly have their privacy affected.

agree that collection at intervals shorter than an hour is going to

raise a

lot of privacy considerations for users who have not opted in.

Pine W

11:49 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

...

I suppose you could get more granular data by conducting an opt-in study

some kind, and you would need to be careful that users who haven't opted

are not accidentally included or indirectly have their privacy affected.

agree that collection at intervals shorter than an hour is going to

raise a

lot of privacy considerations for users who have not opted in.

Benj. Mako Hill

19 Sep 19 Sep

12:21 a.m.

New subject: [Wikimedia-l] wikipedia access traces ?

...

The way that people build representative surveys from non-representative data is by understanding quite a lot about the nature and structure of the bias in your sample. You might want to think about how people do this as trying to create a very complicated system of weights. Folks who do this for US phone surveys, for example, have spent many decades and many millions of dollars on research to understand how to get reliable results and even then it's a quickly moving target. They still routinely sometimes miss things and get things wrong. That said, there are certainly things we can learn. Aaron Shaw and I actually did something related with one of the big Wikipedia surveys in this article: http://www.plosone.org/article/info:doi/10.1371/journal.pone.0065782 In our case, our study was only possible because (a) we had very good luck finding "ground truth" data from the right point in time, (b) we had detailed demographic data on folks from the WP survey, and (c) we make a series of untestable assumptions. After all that work, we still can't know that we've got it right. We really can only suggest that there are reasons to believe our estimates are better that pretending that the opt-in survey is unbiased. In the case of signing up for a Wikipedia toolbar, we might not even attract a sub-population that even /can/ reliably used to build representative estimates. :-( Regards, Mako -- Benjamin Mako Hill http://mako.cc/ Creativity can be a social contribution, but only in so far as society is free to use the results. --GNU Manifesto

Jonathan Morgan

12:43 a.m.

New subject: [Wikimedia-l] wikipedia access traces ?

See what you started, Pine? *This* is what happens when you get professors talking about research methods. :P - J On Thu, Sep 18, 2014 at 2:21 PM, Benj. Mako Hill <mako(a)atdot.cc> wrote:

...

-- Jonathan T. Morgan Learning Strategist Wikimedia Foundation User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)> jmorgan(a)wikimedia.org

Benj. Mako Hill

12:59 a.m.

New subject: [Wikimedia-l] wikipedia access traces ?

...

See what you started, Pine? *This* is what happens when you get professors talking about research methods.

What, you get nearly identical messages written simultaneously by serial co-authors? ;) Later, Mako -- Benjamin Mako Hill http://mako.cc/ Creativity can be a social contribution, but only in so far as society is free to use the results. --GNU Manifesto

Richard Jensen

2:55 a.m.

New subject: [Wikimedia-l] wikipedia access traces ?

the basic issue in sampling is to decide what the target population T actually is. Then you weight the sample so that each person in the target population has an equal chance w and people not in it have weight zero. So what is the target population we want to study? --the world's population? --the world's educated population? --everyone with internet access --everyone who ever uses Wikipedia --everyone who use it a lot --everyone who has knowledge to contribute in positive fashion? --everyone who has the internet, skills and potential to contribute? --everyone who has the potential to contribute but does not do so? Richard Jensen rjensen(a)uic.edu

Pine W

9:36 a.m.

New subject: [Wikimedia-l] wikipedia access traces ?

Let's loop back to the request at hand. Valerio, can you describe your use case for access traces at intervals shorter than one hour? The very likely outcome of this discussion is that the access traces at shorter intervals will not be made available, but I'm curious about what you would do with the data if you had it. Pine On Sep 18, 2014 4:55 PM, "Richard Jensen" <Rjensen(a)uic.edu> wrote:

...

Joe Corneli

2:20 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

On Thu, Sep 18 2014, Pine W wrote:

...

... Well, at least it could be representative of the opt in population, and if that's an interesting enough population it could be worthwhile. For example people who opt to donate during the yearly fund-drive could be further invited to participate in page view tracking, say, and people who've opted in to both conditions might be taken to be representative of donors, who might be taken to be (vaguely) representative of the general population. The data from this group could be factored out against other people who opt into page view tracking who aren't donors, etc etc. (Probably I've described something that's already been done, or that can't be done; I'm not attached to the particular example!) Further OT micro-rant about population research in free/open culture -- Although I'm very naive about Wikipedia research I've been wondering if it would be possible to do a crowd-sourced pattern finding research on Emacs use, combining ideas from: http://www.emacswiki.org/emacs/RepetitionDetection http://popcon.debian.org/ At least in the programming world, I think the "moral" thing to do is to write programs that optimize repeated activities, and that there would be a potentially huge gain to doing this on a population-wide basis rather than on an individual basis. Because despite what I said above the first virtue of individual programmers is laziness! We're perhaps only "moral" at the population level.

Valerio Schiavoni

2:56 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

Hello everyone, it seems the discussion is sparkling an interesting debate, thanks to everyone. To put back things in context, we use Wikipedia as one of the few websites where users can access different 'versions' of the same page. Users mostly read the most recent version of a given page, but from time to time, read accesses to the 'history' of a page happens. New versions of a page are created as well. Finally, users might potentially need to explore several old versions of a given web page, for example by accessing the details of its history[1]. Access traces need to be accurate to model the workload on the servers that are storing the contents being served the web serves. A resolution bigger than 1 second would not reflect the access patterns on Wikipedia, or similarly versioned, web sites. We use these access patterns to test different version-aware storage techniques. For those interested, I could send the pre-print version of an article that I will present next month at the IEEE SRDS'14 conference. For what concern potential privacy concerns about disclosing such traces, I would like to stress that we are not looking into 'who' or from 'where' a given URL was requested. Those informations are completely absent from the Wikibench traces, and can/should remain such in new traces. Let's say Wikipedia somehow reveals the top-10 most-visited pages in the last minute: would that represent a privacy breach for some users? I hardly doubt so, and I invite the audience to convince me about the contrary. Best regards, Valerio 1- For example: http://it.wikipedia.org/w/index.php?title=George_W._Bush&action=history On Fri, Sep 19, 2014 at 8:36 AM, Pine W <wiki.pine(a)gmail.com> wrote:

...

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Pine W

20 Sep 20 Sep

10:45 a.m.

New subject: [Wikimedia-l] wikipedia access traces ?

...

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Pine W

11:02 a.m.

New subject: [Wikimedia-l] wikipedia access traces ?

A few more thoughts: * You probably don't need the full URLs of the content being accessed, so those could be anonymized and replaced with random identifiers to some degree, right? * Someone might be able to monitor the user's end of the transactions, such as by having university network logs that show destination domains and timestamps, in such a way that they could pair the university logs with Wikimedia access traces of one second granularity and thus defeat some measures of privacy for the university's Wikimedia users, correct? * I am not sure that the staff time required to analyze this request and produce the data is a good use of resources on Wikimedia's end. Toby would be a good person to ask about this. Pine On Sep 20, 2014 12:45 AM, "Pine W" <wiki.pine(a)gmail.com> wrote:

...

Thanks for the explanation. On moderate to high traffic pages, let's say with a minimum of 10 hits per minute across the entire time span studied, perhaps the requested data could be provided while still providing strong privacy protection. Toby might need to discuss this with WMF Legal. Pine On Sep 19, 2014 4:57 AM, "Valerio Schiavoni" <valerio.schiavoni(a)gmail.com> wrote: > Hello everyone, > it seems the discussion is sparkling an interesting debate, thanks to > everyone. > > To put back things in context, we use Wikipedia as one of the few > websites where users can access different 'versions' of the same page. > Users mostly read the most recent version of a given page, but from time > to time, read accesses to the 'history' of a page happens. > New versions of a page are created as well. Finally, users might > potentially need to explore several old versions of a given web page, for > example by accessing the details of its history[1]. > Access traces need to be accurate to model the workload on the servers > that are storing the contents being served the web serves. > A resolution bigger than 1 second would not reflect the access patterns > on Wikipedia, or similarly versioned, web sites. > We use these access patterns to test different version-aware storage > techniques. > For those interested, I could send the pre-print version of an article > that > I will present next month at the IEEE SRDS'14 conference. > > For what concern potential privacy concerns about disclosing such traces, > I would like to stress that we are not looking into 'who' or from 'where' a > given URL was requested. Those informations are completely absent from the > Wikibench traces, and can/should remain such in new traces. > > Let's say Wikipedia somehow reveals the top-10 most-visited pages in the > last minute: would that represent a privacy breach for some users? I hardly > doubt so, and I invite the audience to convince me about the contrary. > > Best regards, > Valerio > > 1- For example: > http://it.wikipedia.org/w/index.php?title=George_W._Bush&action=history > > On Fri, Sep 19, 2014 at 8:36 AM, Pine W <wiki.pine(a)gmail.com> wrote: > >> Let's loop back to the request at hand. Valerio, can you describe your >> use case for access traces at intervals shorter than one hour? The very >> likely outcome of this discussion is that the access traces at shorter >> intervals will not be made available, but I'm curious about what you would >> do with the data if you had it. >> >> Pine >> On Sep 18, 2014 4:55 PM, "Richard Jensen" <Rjensen(a)uic.edu> wrote: >> >>> the basic issue in sampling is to decide what the target population T >>> actually is. Then you weight the sample so that each person in the target >>> population has an equal chance w and people not in it have weight zero. >>> >>> So what is the target population we want to study? >>> --the world's population? >>> --the world's educated population? >>> --everyone with internet access >>> --everyone who ever uses Wikipedia >>> --everyone who use it a lot >>> --everyone who has knowledge to contribute in positive fashion? >>> --everyone who has the internet, skills and potential to contribute? >>> --everyone who has the potential to contribute but does not do so? >>> >>> Richard Jensen >>> rjensen(a)uic.edu >>> >>> >>> _______________________________________________ >>> Wiki-research-l mailing list >>> Wiki-research-l(a)lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >> >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > >

Oliver Keyes

8:08 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

Given that the request logs aren't transparent about which cached version of a page is being provided I'm finding it pretty difficult to see how they'd help you answer interesting questions here :/. On 20 September 2014 04:02, Pine W <wiki.pine(a)gmail.com> wrote:

...

Thanks for the explanation. On moderate to high traffic pages, let's say with a minimum of 10 hits per minute across the entire time span studied, perhaps the requested data could be provided while still providing strong privacy protection. Toby might need to discuss this with WMF Legal. Pine On Sep 19, 2014 4:57 AM, "Valerio Schiavoni" <valerio.schiavoni(a)gmail.com> wrote: > Hello everyone, > it seems the discussion is sparkling an interesting debate, thanks to > everyone. > > To put back things in context, we use Wikipedia as one of the few > websites where users can access different 'versions' of the same page. > Users mostly read the most recent version of a given page, but from time > to time, read accesses to the 'history' of a page happens. > New versions of a page are created as well. Finally, users might > potentially need to explore several old versions of a given web page, for > example by accessing the details of its history[1]. > Access traces need to be accurate to model the workload on the servers > that are storing the contents being served the web serves. > A resolution bigger than 1 second would not reflect the access patterns > on Wikipedia, or similarly versioned, web sites. > We use these access patterns to test different version-aware storage > techniques. > For those interested, I could send the pre-print version of an article > that > I will present next month at the IEEE SRDS'14 conference. > > For what concern potential privacy concerns about disclosing such > traces, I would like to stress that we are not looking into 'who' or from > 'where' a given URL was requested. Those informations are completely absent > from the Wikibench traces, and can/should remain such in new traces. > > Let's say Wikipedia somehow reveals the top-10 most-visited pages in the > last minute: would that represent a privacy breach for some users? I hardly > doubt so, and I invite the audience to convince me about the contrary. > > Best regards, > Valerio > > 1- For example: > http://it.wikipedia.org/w/index.php?title=George_W._Bush&action=history > > On Fri, Sep 19, 2014 at 8:36 AM, Pine W <wiki.pine(a)gmail.com> wrote: > >> Let's loop back to the request at hand. Valerio, can you describe your >> use case for access traces at intervals shorter than one hour? The very >> likely outcome of this discussion is that the access traces at shorter >> intervals will not be made available, but I'm curious about what you would >> do with the data if you had it. >> >> Pine >> On Sep 18, 2014 4:55 PM, "Richard Jensen" <Rjensen(a)uic.edu> wrote: >> >>> the basic issue in sampling is to decide what the target population T >>> actually is. Then you weight the sample so that each person in the target >>> population has an equal chance w and people not in it have weight zero. >>> >>> So what is the target population we want to study? >>> --the world's population? >>> --the world's educated population? >>> --everyone with internet access >>> --everyone who ever uses Wikipedia >>> --everyone who use it a lot >>> --everyone who has knowledge to contribute in positive fashion? >>> --everyone who has the internet, skills and potential to contribute? >>> --everyone who has the potential to contribute but does not do so? >>> >>> Richard Jensen >>> rjensen(a)uic.edu >>> >>> >>> _______________________________________________ >>> Wiki-research-l mailing list >>> Wiki-research-l(a)lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >> >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > >

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- Oliver Keyes Research Analyst Wikimedia Foundation

Samuel Klein

22 Sep 22 Sep

12:59 a.m.

New subject: [Wikimedia-l] wikipedia access traces ?

Both the desire for highly granular data and the concerns about privacy seem somewhat caricatured in this conversation :) Valerio writes:

...

Access traces need to be accurate to model the workload on the servers that are storing the contents being served the web serves. A resolution bigger than 1 second would not reflect the access patterns on Wikipedia, or similarly versioned, web sites.

I don't understand your last sentence. Why can't you do the analysis you describe with hour-resolution data? It might help this discussion if you did a sample analysis for one page & one day, with available data, and indicated where higher res would help. Pine writes:

...

Someone might be able to monitor the user's end of the transactions, such as by having university network logs that show destination domains and timestamps, in such a way that they could pair the university logs with Wikimedia access traces of one second granularity and thus defeat some measures of privacy for the university's Wikimedia users, correct?

en.wp gets 2000+ pageviews/s, so not much privacy is lost in that scenario, which is already pretty narrow: if you have access to the university logs, you might have access to the full destination url. I'm having a hard time seeing how high-res data (full urls, no source) would be a privacy risk – but if needed, binning could likely be done closer to the second than to the hour. Warmly, Sam On Fri, Sep 19, 2014 at 7:56 AM, Valerio Schiavoni <valerio.schiavoni(a)gmail.com> wrote:

...

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- Samuel Klein @metasj w:user:sj +1 617 529 4266

Pine W

8:13 a.m.

New subject: [Wikimedia-l] wikipedia access traces ?

Hm, on the second point the person to ask is Toby, but it sounds like there are reasons for the minimun one hour granulatity, and with Oliver's point it sounds like this research approach won't produce the intended benefits anyway. Perhaps another reason for one hour minimum granulatity is because of the storage and other resource requirements for highly granular data are too expensive to justify the benefits. Pine Pine

Ahmed Aley

9:38 a.m.

New subject: [Wikimedia-l] wikipedia access traces ?

Hi, IR-Cache provide their traces on less than a second granularity. They have been doing that for years. The way they deal with the storage problem is by having a rotating log with maximum one week, so when they will add a new file for today, they will delete the one for Monday last week. Anyone requiring to use data of more than one week needs to write his own script or download the files at least once a week. Should Wikimedia provide such data, there shouldn't be a storage problem. Best, Ahmed On Mon, Sep 22, 2014 at 7:13 AM, Pine W <wiki.pine(a)gmail.com> wrote:

...

Valerio Schiavoni

1:45 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

Hello, On Mon, Sep 22, 2014 at 8:38 AM, Ahmed Aley <ahmeda(a)cs.umu.se> wrote:

...

IR-Cache provide their traces on less than a second granularity.

What and where is this IR-Cache ? A quick google search did not help...

Ahmed Aley

1:59 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

Hi, there website has been down for a few days. here is a cahed versionÖ http://webcache.googleusercontent.com/search?q=cache:apD3FN7QLxgJ:www.ircac… ++Ahmed On Mon, Sep 22, 2014 at 12:45 PM, Valerio Schiavoni < valerio.schiavoni(a)gmail.com> wrote:

...

Hello, On Mon, Sep 22, 2014 at 8:38 AM, Ahmed Aley <ahmeda(a)cs.umu.se> wrote:

IR-Cache provide their traces on less than a second granularity.

What and where is this IR-Cache ? A quick google search did not help... _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Benj. Mako Hill

24 Sep 24 Sep

9:13 a.m.

New subject: [Wikimedia-l] wikipedia access traces ?

...

Users mostly read the most recent version of a given page, but from time to time, read accesses to the 'history' of a page happens.

At least as far as know, views to historical versions of webpages in Wikipedia don't show up in the access logs at all because certain kinds of requests (like requests to /w/index.php?oldid=NUMBER) don't get recorded in the pageview data.

...

New versions of a page are created as well. Finally, users might potentially need to explore several old versions of a given web page, for example by accessing the details of its history[1].

AFAIK, viewing the history page itself is also not recorded in the page view data either. Regards, Mako -- Benjamin Mako Hill http://mako.cc/ Creativity can be a social contribution, but only in so far as society is free to use the results. --GNU Manifesto

Valerio Schiavoni

1:09 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

Hello Mako, On Wed, Sep 24, 2014 at 8:13 AM, Benj. Mako Hill <mako(a)atdot.cc> wrote:

...

Users mostly read the most recent version of a given page, but from time

time, read accesses to the 'history' of a page happens.

I'm sorry to contradict you, but at least on the Wikibench traces, that information is very well present. I see things like: 1609418296 1190438479.078 http://en.wikipedia.org/w/index.php?title=Western_betrayal&oldid=982812… That is, back in 2007, users were accessing a version of that page that dated back in 2005 or so.

...

New versions of a page are created as well. Finally, users might

potentially need to explore several old versions of a given web page, for example by accessing the details of its history[1].

AFAIK, viewing the history page itself is also not recorded in the page view data either.

Scott Hale

2:59 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

Hi Valerio, Mako was referring to https://dumps.wikimedia.org/other/pagecounts-raw/ and the current logging practices. My understanding is also that these things are not logged on a routine basis. The Wikibench traces seem to have been a special case. I've also contacted the researchers who partially released it, but making

...

it publicly available is tricky for them, due to its size (12 TB), which might instead be somehow in the norms of the operations taken daily by Wikipedia servers.

Have the researchers looked into requester-pays data storage on Amazon or another provider? They should be able to make it public with no resources and at no cost to themselves whatever the size. Cheers, Scott On Wed, Sep 24, 2014 at 7:09 PM, Valerio Schiavoni < valerio.schiavoni(a)gmail.com> wrote:

...

Hello Mako, On Wed, Sep 24, 2014 at 8:13 AM, Benj. Mako Hill <mako(a)atdot.cc> wrote:

Users mostly read the most recent version of a given page, but from

time to

time, read accesses to the 'history' of a page happens.

New versions of a page are created as well. Finally, users might

potentially need to explore several old versions of a given web page, for example by accessing the details of its history[1].

AFAIK, viewing the history page itself is also not recorded in the page view data either.

Sorry to contradict you again, but there are indeed logs for that as well: http://en.wikipedia.org/w/index.php?title=Marina_Nadiradze&action=histo… I'm quite surprised that such informations are not known by the community of Wikipedia researchers. Best, Valerio _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- Scott Hale Oxford Internet Institute University of Oxford http://www.scotthale.net/ scott.hale(a)oii.ox.ac.uk

Benj. Mako Hill

7:08 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

...

...

I'm quite surprised that such informations are not known by the community of Wikipedia researchers.

Well, my ignorance is my own and does not reflect the community of Wikipedia researchers. :) But, as Scott pointed out, I was referring to pagecount data published by WMF (i.e., the data binned by hour that we were discussing in the sub-thread). I was replying to the discussion about the granularity of the pagecount data to point out that increased granularity won't help you because the data you want isn't provided in /that/ dataset at all. Wikibench is the only source of data I know of that includes hits to the "/w/index.php" pages for all of Wikipedia (I'd love to hear that I'm wrong about that). Unfortunately, Wikibench was, as far as I know, basically a one-off thing. It's great if you want a 10% sample of this kind of data for a ~3.5 months period in late 2007. If you want anything that is less stale, I think you're going to have to try to cut a deal with WMF to collect it. Regards, Mako -- Benjamin Mako Hill http://mako.cc/ Creativity can be a social contribution, but only in so far as society is free to use the results. --GNU Manifesto

Valerio Schiavoni

7:14 p.m.

New subject: [Wikimedia-l] wikipedia access traces ?

Hi On Wed, Sep 24, 2014 at 6:08 PM, Benj. Mako Hill <mako(a)atdot.cc> wrote:

...

I'm quite surprised that such informations are not known by the community of Wikipedia researchers.

Well, my ignorance is my own and does not reflect the community of Wikipedia researchers. :)

I did not mean to offend anyone with this sentence. Please, community of Wikipedia researchers, accept my apologizes.

...

If you want anything that is less stale, I think you're going to have to try to cut a deal with WMF to collect it.

If the WMF is open to discuss this aspect, I'll be more than ready to discuss possible agreements. Anyone could please point me to the right people to contact to discuss this possibility ? Thanks, Valerio

Michael Maggs

28 Oct 28 Oct

7:46 p.m.

New subject: 'Wikipedia Network Analysis' by Brian Keegan

(Apologies if this has been referred to already on this list. If so, I missed it). A couple of Weeks ago, Brian Keegan published a very nice blog post [1] on the use of Python for Wikimedia research. He uses examples from the English Wikipedia but the techniques he describes are applicable more generally. It’s fascinating, and shows what a lot can be done with a few lines of code. Michael [1] http://nbviewer.ipython.org/github/brianckeegan/Wikipedia-Network-Analysis/…

Maximilian Klein

30 Oct 30 Oct

7:21 p.m.

New subject: 'Wikipedia Network Analysis' by Brian Keegan

IPython notebook FTW. Thanks for sharing. Make a great day, Max Klein ‽ http://notconfusing.com/ On Tue, Oct 28, 2014 at 10:46 AM, Michael Maggs <Michael(a)maggs.name> wrote:

...

Dario Taraborelli

31 Oct 31 Oct

6:04 a.m.

New subject: 'Wikipedia Network Analysis' by Brian Keegan

(shameless plug) for those of you who are on Twitter, follow @WikiResearch <https://twitter.com/WikiResearch> and you won’t miss any of these announcements. Dario

...

On Oct 30, 2014, at 10:21 AM, Maximilian Klein <isalix(a)gmail.com> wrote: IPython notebook FTW. Thanks for sharing. Make a great day, Max Klein ‽ http://notconfusing.com/ <http://notconfusing.com/> On Tue, Oct 28, 2014 at 10:46 AM, Michael Maggs <Michael(a)maggs.name <mailto:Michael@maggs.name>> wrote: (Apologies if this has been referred to already on this list. If so, I missed it). A couple of Weeks ago, Brian Keegan published a very nice blog post [1] on the use of Python for Wikimedia research. He uses examples from the English Wikipedia but the techniques he describes are applicable more generally. It’s fascinating, and shows what a lot can be done with a few lines of code. Michael [1] http://nbviewer.ipython.org/github/brianckeegan/Wikipedia-Network-Analysis/… <http://nbviewer.ipython.org/github/brianckeegan/Wikipedia-Network-Analysis/blob/master/Wikipedia%20Network%20Analysis.ipynb> _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org <mailto:Wiki-research-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l <https://lists.wikimedia.org/mailman/listinfo/wiki-research-l> _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Divya S

8:32 a.m.

New subject: 'Wikipedia Network Analysis' by Brian Keegan

Folks, Sorry for the interruption but I need to unsubscribe from this group. Can someone please help? Many thanks, Divya On Fri, Oct 31, 2014 at 9:34 AM, Dario Taraborelli < dtaraborelli(a)wikimedia.org> wrote:

...

(shameless plug) for those of you who are on Twitter, follow @WikiResearch <https://twitter.com/WikiResearch> and you won’t miss any of these announcements. Dario On Oct 30, 2014, at 10:21 AM, Maximilian Klein <isalix(a)gmail.com> wrote: IPython notebook FTW. Thanks for sharing. Make a great day, Max Klein ‽ http://notconfusing.com/ On Tue, Oct 28, 2014 at 10:46 AM, Michael Maggs <Michael(a)maggs.name> wrote:

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- Regards, Divya S. <http://about.me/sdivya>

3459

days inactive

3510

days old

wiki-research-l@lists.wikimedia.org

Manage subscription

38 comments

17 participants

tags (0)

participants (17)

Aaron Halfaker
Aaron Halfaker
Ahmed Aley
Benj. Mako Hill
Dario Taraborelli
Divya S
Giovanni Luca Ciampaglia
Joe Corneli
Jonathan Morgan
Maximilian Klein
Michael Maggs
Oliver Keyes
Pine W
Richard Jensen
Samuel Klein
Scott Hale
Valerio Schiavoni