Re: [Wiki-research-l] [Wikimedia-l] wikipedia access traces ?

20 Sep 2014

Given that the request logs aren't transparent about which cached version
of a page is being provided I'm finding it pretty difficult to see how
they'd help you answer interesting questions here :/.

On 20 September 2014 04:02, Pine W &lt;wiki.pine(a)gmail.com&gt; wrote:

...
  A few more thoughts:

 * You probably don't need the full URLs of the content being accessed, so
 those could be anonymized and replaced with random identifiers to some
 degree, right?

 * Someone might be able to monitor the user's end of the transactions,
 such as by having university network logs that show destination domains and
 timestamps, in such a way that they could pair the university logs with
 Wikimedia access traces of one second granularity and thus defeat some
 measures of privacy for the university's Wikimedia users, correct?

 * I am not sure that the staff time required to analyze this request and
 produce the data is a good use of resources on Wikimedia's end. Toby would
 be a good person to ask about this.

 Pine
  On Sep 20, 2014 12:45 AM, "Pine W" &lt;wiki.pine(a)gmail.com&gt; wrote:

  Thanks for the explanation. On moderate to high
traffic pages, let's say
 with a minimum of 10 hits per minute across the entire time span studied,
 perhaps the requested data could be provided while still providing strong
 privacy protection. Toby might need to discuss this with WMF Legal.

 Pine
 On Sep 19, 2014 4:57 AM, "Valerio Schiavoni"
&lt;valerio.schiavoni(a)gmail.com&gt;
 wrote:

> Hello everyone,
> it seems the discussion is sparkling an interesting debate, thanks to
> everyone.
>
> To put back things in context, we use Wikipedia as one of the few
> websites where users can access different 'versions' of the same page.
> Users mostly read the most recent version of a given page, but from time
> to time, read accesses to the 'history' of a page happens.
> New versions of a page are created as well. Finally, users might
> potentially need to explore several old versions of a given web page, for
> example by accessing the details of its history[1].
> Access traces need to be accurate to model the workload on the servers
> that are storing the contents being served the web serves.
> A resolution bigger than 1 second would not reflect the access patterns
> on Wikipedia, or similarly versioned, web sites.
> We use these access patterns to test different version-aware storage
> techniques.
> For those interested, I could send the pre-print version of an article
> that
> I will present next month at the IEEE SRDS'14 conference.
>
> For what concern potential privacy concerns about disclosing such
> traces, I would like to stress that we are not looking into 'who' or from
> 'where' a given URL was requested. Those informations are completely absent
> from the Wikibench traces, and can/should remain such in new traces.
>
> Let's say Wikipedia somehow reveals the top-10 most-visited pages in the
> last minute: would that represent a privacy breach for some users? I hardly
> doubt so, and I invite the audience to convince me about the contrary.
>
> Best regards,
> Valerio
>
> 1- For example:
> http://it.wikipedia.org/w/index.php?title=George_W._Bush&action=history
>
> On Fri, Sep 19, 2014 at 8:36 AM, Pine W &lt;wiki.pine(a)gmail.com&gt; wrote:
>
>> Let's loop back to the request at hand. Valerio, can you describe your
>> use case for access traces at intervals shorter than one hour? The very
>> likely outcome of this discussion is that the access traces at shorter
>> intervals will not be made available, but I'm curious about what you would
>> do with the data if you had it.
>>
>> Pine
>> On Sep 18, 2014 4:55 PM, "Richard Jensen" &lt;Rjensen(a)uic.edu&gt;
wrote:
>>
>>> the basic issue in sampling is to decide what the target population T
>>> actually is. Then you weight the sample so that each person in the target
>>> population has an equal chance w  and people not in it have weight zero.
>>>
>>> So what is the target population we want to study?
>>> --the world's population?
>>> --the world's educated population?
>>> --everyone with internet access
>>> --everyone who ever uses Wikipedia
>>> --everyone who use it a lot
>>> --everyone  who has knowledge to contribute in positive fashion?
>>> --everyone  who has the internet, skills and potential to contribute?
>>> --everyone  who has the potential to contribute but does not do so?
>>>
>>> Richard Jensen
>>> rjensen(a)uic.edu
>>>
>>>
>>> _______________________________________________
>>> Wiki-research-l mailing list
>>> Wiki-research-l(a)lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> Wiki-research-l(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>  _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] [Wikimedia-l] wikipedia access traces ?