Hello everyone, it seems the discussion is sparkling an interesting debate, thanks to everyone.
To put back things in context, we use Wikipedia as one of the few websites where users can access different 'versions' of the same page. Users mostly read the most recent version of a given page, but from time to time, read accesses to the 'history' of a page happens. New versions of a page are created as well. Finally, users might potentially need to explore several old versions of a given web page, for example by accessing the details of its history[1]. Access traces need to be accurate to model the workload on the servers that are storing the contents being served the web serves. A resolution bigger than 1 second would not reflect the access patterns on Wikipedia, or similarly versioned, web sites. We use these access patterns to test different version-aware storage techniques. For those interested, I could send the pre-print version of an article that I will present next month at the IEEE SRDS'14 conference.
For what concern potential privacy concerns about disclosing such traces, I would like to stress that we are not looking into 'who' or from 'where' a given URL was requested. Those informations are completely absent from the Wikibench traces, and can/should remain such in new traces.
Let's say Wikipedia somehow reveals the top-10 most-visited pages in the last minute: would that represent a privacy breach for some users? I hardly doubt so, and I invite the audience to convince me about the contrary.
Best regards, Valerio
1- For example: http://it.wikipedia.org/w/index.php?title=George_W._Bush&action=history
On Fri, Sep 19, 2014 at 8:36 AM, Pine W wiki.pine@gmail.com wrote:
Let's loop back to the request at hand. Valerio, can you describe your use case for access traces at intervals shorter than one hour? The very likely outcome of this discussion is that the access traces at shorter intervals will not be made available, but I'm curious about what you would do with the data if you had it.
Pine On Sep 18, 2014 4:55 PM, "Richard Jensen" Rjensen@uic.edu wrote:
the basic issue in sampling is to decide what the target population T actually is. Then you weight the sample so that each person in the target population has an equal chance w and people not in it have weight zero.
So what is the target population we want to study? --the world's population? --the world's educated population? --everyone with internet access --everyone who ever uses Wikipedia --everyone who use it a lot --everyone who has knowledge to contribute in positive fashion? --everyone who has the internet, skills and potential to contribute? --everyone who has the potential to contribute but does not do so?
Richard Jensen rjensen@uic.edu
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l