Re: [Wiki-research-l] [Wikimedia-l] wikipedia access traces ?

21 Sep 2014


      Both the desire for highly granular data and the concerns about
privacy seem somewhat caricatured in this conversation :)
Valerio writes:
...
Access traces need to be accurate to model the workload on the servers
that are storing the contents being served the web serves.
A resolution bigger than 1 second would not reflect the access patterns on
Wikipedia, or similarly versioned, web sites.
I don't understand your last sentence.  Why can't you do the analysis
you describe with hour-resolution data?  It might help this discussion
if you did a sample analysis for one page & one day, with available
data, and indicated where higher res would help.
Pine writes:
...
Someone might be able to monitor the user's end of the transactions, such
as by having university network logs that show destination domains and
timestamps, in such a way that they could pair the university logs with
Wikimedia access traces of one second granularity and thus defeat some
measures of privacy for the university's Wikimedia users, correct?
en.wp gets 2000+ pageviews/s, so not much privacy is lost in that
scenario, which is already pretty narrow: if you have access to the
university logs, you might have access to the full destination url.
I'm having a hard time seeing how high-res data (full urls, no source)
would be a privacy risk – but if needed, binning could likely be done
closer to the second than to the hour.
Warmly, Sam
On Fri, Sep 19, 2014 at 7:56 AM, Valerio Schiavoni
valerio.schiavoni@gmail.com wrote:
...
Hello everyone,
it seems the discussion is sparkling an interesting debate, thanks to
everyone.
To put back things in context, we use Wikipedia as one of the few websites
where users can access different 'versions' of the same page.
Users mostly read the most recent version of a given page, but from time to
time, read accesses to the 'history' of a page happens.
New versions of a page are created as well. Finally, users might potentially
need to explore several old versions of a given web page, for example by
accessing the details of its history[1].
Access traces need to be accurate to model the workload on the servers that
are storing the contents being served the web serves.
A resolution bigger than 1 second would not reflect the access patterns on
Wikipedia, or similarly versioned, web sites.
We use these access patterns to test different version-aware storage
techniques.
For those interested, I could send the pre-print version of an article that
I will present next month at the IEEE SRDS'14 conference.
For what concern potential privacy concerns about disclosing such traces, I
would like to stress that we are not looking into 'who' or from 'where' a
given URL was requested. Those informations are completely absent from the
Wikibench traces, and can/should remain such in new traces.
Let's say Wikipedia somehow reveals the top-10 most-visited pages in the
last minute: would that represent a privacy breach for some users? I hardly
doubt so, and I invite the audience to convince me about the contrary.
Best regards,
Valerio
1- For example:
http://it.wikipedia.org/w/index.php?title=George_W._Bush&action=history
On Fri, Sep 19, 2014 at 8:36 AM, Pine W wiki.pine@gmail.com wrote:
...
Let's loop back to the request at hand. Valerio, can you describe your use
case for access traces at intervals shorter than one hour? The very likely
outcome of this discussion is that the access traces at shorter intervals
will not be made available, but I'm curious about what you would do with the
data if you had it.
Pine
On Sep 18, 2014 4:55 PM, "Richard Jensen" Rjensen@uic.edu wrote:
...
the basic issue in sampling is to decide what the target population T
actually is. Then you weight the sample so that each person in the target
population has an equal chance w  and people not in it have weight zero.
So what is the target population we want to study?
--the world's population?
--the world's educated population?
--everyone with internet access
--everyone who ever uses Wikipedia
--everyone who use it a lot
--everyone  who has knowledge to contribute in positive fashion?
--everyone  who has the internet, skills and potential to contribute?
--everyone  who has the potential to contribute but does not do so?
Richard Jensen
rjensen@uic.edu

Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- 
Samuel Klein          @metasj           w:user:sj          +1 617 529 4266

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] [Wikimedia-l] wikipedia access traces ?