[Foundation-l] Fwd: [Fwd: Re: Information for the Wikimedia user community]

Sat Sep 15 22:05:43 UTC 2007

---Forwarded on request of the sender, who is not subscribed to the list----

tstarling said:
> I have posted to our public mailing list "foundation-l" about your
> project and your request for private data. They have the following
> questions:

> * How many people would require access to the data?
> * What is your research goal? Is it technical or sociological study?
> * How long would you need to collect data for, and how long would you
> store it?

> You can reply to me, or you can read and reply to the thread itself:

Hi,

I've been reading the thread, and I'll try to address your general
concerns and your specific questions. Since I'm not a subscriber of
foundation-l, please, CC me in your answers. [Sorry for the long post.
If you want a summary, and just the answer to those specific questions,
go straight to the end]

First of all, some backgrund. We at the GSyC/LibreSoft have been
researching the libre (free, open source) software development community
for years. We focus mainly on public data, such as CVS/SVN repositories
or mailing lists. With that (massive) information, we try to improve the
understanding the development and maintenance of libre software.

In this area, we have seldom used non-public data, when it was not
distributred for privacy-related reasons. For instance, that applies to
some private data of SourceForge users (distributed for academic users
by University of Notre Dame [1]), or to the actual archives of mailing
lists as kept by some projects (in many cases, public versions of the
archives do not include real email address because of spam-related
issues). In theses cases, having access to that specific information
allowed us to research some aspects (such as geographical origin of
developers and participants in libre software communities) which would
be impossible otherwise.

About two years ago, we found that Wikipedia was an interesting case,
from a research point of view, for many reasons. Felipe Ortega, one of
our PhD students, started to explore that way by building the WikiXRay
tool [2], and using it to perform several studies using Wikipedia dumps
as source data.

Now, we're exploring a new line which has more to do with the system
that provides the Wikipedia service. The long-shot goal is to understand
it, to profile it, and to find ways to improve it. From an academic
point of view, the Wikimedia system is real gold: one of the top
Internet sites, with almost all the  information (content,
architechture, etc.) available for inspection. Both from a pedagogical
and from a research viewpoint, it is rather interesting.

When we (that's mainly Felipe Ortega, Antonio Reinoso, in CC, and
Gregorio Robles) started to consider the Wikimedia system from an
architectural and networking point of view, one of the first issues that
were raised were the convenience of having access to reliable statistics
about its behavior. Antonio contacted Tim for that, and it seems that
the easiest data to be provided was those sampled dumps of Squid logs
that we're now talking about.

This is all for context. Now, before entering in the details of
privacy-related information, let me also say that we would like to work
with you to find the more appropriate way of getting as much non-privacy
related information, suitable for research, that you may consider
reasonable. And of course, to find ways of making it available to the
research community as a whole. Thanks to your transparency and sharing
of knowledge ideals, and to the technical relevance of the site, with
time the Wikimedia system could be one of the canonical case studies for
the research community, and we would like to help to make that happen.
For instance, intrumenting (or maybe just logging) Wikimedia software in
the proper way, we could for instance profile different kinds of
requests (from the Squid front end to the database), identify
bottlenecks, measure delays and bandwidths in different steps in the
interactions, etc.

This said, we're of course ready to respect your policies. If for reason
you prefer to only provide that data yourself, or to trim it from such
or such other information (for privacy or other reasons), that's ok.
What I would like is to identify the information you could provide which
is useful for research, while letting you happy, not harming
performance, respecting your policies, etc.

With respect to privacy information, I fully understand your concerns,
and I'm also familiar with them, because of our previous work with the
libre software community. In fact, after some years of experience, we've
found that in many cases the best thing is to work jointly with projects
to identify which information and how, can be make available, maybe
under different conditions, to specific research groups or to the public
at large. I would like to do the same with Wikimedia, if possible.

Now, your specific questions (I understand that they refer to
information that could be used with some ease to track indivudual
identities).

> * How many people would require access to the data?

As little as possible. To start with, just Antonio and me, and probably
other researchers at my group. However, maybe this could be used as a
test-case to define some conditions that could be offered in the future
to other research groups. To be honest, I won't like to be the only
group with access to such data, since any study we make on them won't be
reproductible by others, and therefore it can hardly be called
research...

> * What is your research goal? Is it technical or sociological study?

Both. In the specific case of IP addresses, I would like to use it
mainly for geotargeting, which would allow for several interesting
studies. For instance, in the "sociological" side, it would be nice to
know the share of different countries for certain language editions of
the wikipedia (both in edits and reads): consider the case of English,
Spanish or Chinese, which for sure present different patterns. But it
can also be used to undertand how proxies are dealing with requests from
different mega-carriers. Or to identify crawlers and similar.

Of course, there are in some cases alternate ways of doing this kind of
research, but in most cases having IP addresses is the straightest way,
or the most reliable.

For now, we're not interested in individual patterns, and that's why
1/1,000 and even smaller samples are more than enough, if they are
reasonably non-biased.

> * How long would you need to collect data for, and how long would you
> store it?

Ideally, we would like to do it continuously over time, since the
dynamic evolution is quite interesting. But we're of course ready to
impse time limits if needed.

In summary, we are very thankful to the Wikimedia community for
providing as much information as you are providing now. We hope to go on
using it to better understand Wikipedia, Wikimedia systems, etc. But we
would also like to work with you to identify other sources of
information, which are currently not provided, but maybe could be
without harming Wikimedia or its users, and would be of  great interest
for the research community. And would like to to all this in a way that
other research groups may also benefit from the data.

As somebody said in the previous thread, most of this can be done
either:

(1) by providing the data to researchers, or

(2) by asking researchers to write scripts or the like that run at
Wikimedia facilties, producing the output that would be sent to the
researchers without actually delivering the source data.

We would prefer (1) becase it depends less on the resources that
Wikimedia may have for implementing (2), because maybe (2) won't scale
if many groups start using those data, and because (2) makes review and
reproductibility of research more difficult. But if you prefer (2) (or
prefer it in some specific cases, such as the IP addresses of client
machines), let's see how we can implement it.

Again, sorry for the long message, and thanks for reading up to here.
I'd be happy to answer any comment you may have.

Saludos,

        Jesus.

[1] http://www.nd.edu/~oss/Data/data.html

[2] http://meta.wikimedia.org/wiki/WikiXRay