Hi everybody!
How are you? I hope happy and fine. I am Mayo Fuster Morell doing a Phd research on Wikipedia governance at the European University Institute.
I would appreciate if you could help me with three specific doubts that I have on Wikipedia data.
* Is there data or research results on number of users per article? Plus, what is the more frequent number of users per article? Or, what is the distribution of number of editors/article?
* Is there data on gender distribution? I know there is the general survey (which say 12.83% women are contributors) and Ortega works. Is there any other data on gender distribution?.
* Does the site learn from the navigation and searches? That is, if a Wikipedia visitor who reads a Network entry then goes to the Manuel Castells entry, Will the system understand there is a connexion between them? Will next time put them together when presenting search results?
Any suggestion to solve these doubts is very welcome! Looking forward for Wikimania! Mayo
«·´`·.(*·.¸(`·.¸ ¸.·´)¸.·*).·´`·» «·´¨*·¸¸« Mayo Fuster Morell ».¸.·*¨`·» «·´`·.(¸.·´(¸.·* *·.¸)`·.¸).·´`·»
Research Digital Commons Governance: http://www.onlinecreation.info European University Institute - Phd Candidate School of information Berkeley Visiting researcher Phone Italy: 0039-3345440747 or 0039-0558409982 Phone Spanish State: 0034-648877748 E-mail: mayo.fuster@eui.eu Skype: mayoneti Identi.ca: Mayo Postal address: Badia Fiesolana - Via dei Roccettini 9, I-50014 San Domenico di Fiesole (FI) - Italy Fax [+39] 055 4685 201
Hi Mayo,
On Sun, Apr 11, 2010 at 9:06 AM, Fuster, Mayo Mayo.Fuster@eui.eu wrote:
Hi everybody!
How are you? I hope happy and fine. I am Mayo Fuster Morell doing a Phd research on Wikipedia governance at the European University Institute.
I would appreciate if you could help me with three specific doubts that I have on Wikipedia data.
- Is there data or research results on number of users per article? Plus,
what is the more frequent number of users per article? Or, what is the distribution of number of editors/article?
I was about to say that this can be derived easily from an analysis of the
revisions database table, but then I noticed that no dump of this table in isolation is available from download.wikimedia.org... You can however access all this data by using the Wikipedia API ( http://www.mediawiki.org/wiki/API) but it requires some programming.
- Is there data on gender distribution? I know there is the general survey
(which say 12.83% women are contributors) and Ortega works. Is there any other data on gender distribution?.
- Does the site learn from the navigation and searches? That is, if a
Wikipedia visitor who reads a Network entry then goes to the Manuel Castells entry, Will the system understand there is a connexion between them? Will next time put them together when presenting search results?
Any suggestion to solve these doubts is very welcome! Looking forward for Wikimania! Mayo
«·´`·.(*·.¸(`·.¸ ¸.·´)¸.·*).·´`·» «·´¨*·¸¸« Mayo Fuster Morell ».¸.·*¨`·» «·´`·.(¸.·´(¸.·* *·.¸)`·.¸).·´`·»
Research Digital Commons Governance: http://www.onlinecreation.info European University Institute - Phd Candidate School of information Berkeley Visiting researcher Phone Italy: 0039-3345440747 or 0039-0558409982 Phone Spanish State: 0034-648877748 E-mail: mayo.fuster@eui.eu Skype: mayoneti Identi.ca: Mayo Postal address: Badia Fiesolana - Via dei Roccettini 9, I-50014 San Domenico di Fiesole (FI) - Italy Fax [+39] 055 4685 201
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Thank you Luca! Unfortunately I do not have programming skills. I am looking for research results or data already available on these questions to make a reference in my research writing. Thanks again. Have a nice day, Mayo
«·´`·.(*·.¸(`·.¸ ¸.·´)¸.·*).·´`·» «·´¨*·¸¸« Mayo Fuster Morell ».¸.·*¨`·» «·´`·.(¸.·´(¸.·* *·.¸)`·.¸).·´`·»
Research Digital Commons Governance: http://www.onlinecreation.info European University Institute - Phd Candidate School of information Berkeley Visiting researcher Phone Italy: 0039-3345440747 or 0039-0558409982 Phone Spanish State: 0034-648877748 E-mail: mayo.fuster@eui.eu Skype: mayoneti Identi.ca: Mayo Postal address: Badia Fiesolana - Via dei Roccettini 9, I-50014 San Domenico di Fiesole (FI) - Italy Fax [+39] 055 4685 201
-----Missatge original----- De: wiki-research-l-bounces@lists.wikimedia.org en nom de Luca de Alfaro Enviat el: dg. 11/04/2010 23:14 Per a: Research into Wikimedia content and communities Tema: Re: [Wiki-research-l] Help to solve three doubts on Wikipediaresearch data
Hi Mayo,
On Sun, Apr 11, 2010 at 9:06 AM, Fuster, Mayo Mayo.Fuster@eui.eu wrote:
Hi everybody!
How are you? I hope happy and fine. I am Mayo Fuster Morell doing a Phd research on Wikipedia governance at the European University Institute.
I would appreciate if you could help me with three specific doubts that I have on Wikipedia data.
- Is there data or research results on number of users per article? Plus,
what is the more frequent number of users per article? Or, what is the distribution of number of editors/article?
I was about to say that this can be derived easily from an analysis of the
revisions database table, but then I noticed that no dump of this table in isolation is available from download.wikimedia.org... You can however access all this data by using the Wikipedia API ( http://www.mediawiki.org/wiki/API) but it requires some programming.
- Is there data on gender distribution? I know there is the general survey
(which say 12.83% women are contributors) and Ortega works. Is there any other data on gender distribution?.
- Does the site learn from the navigation and searches? That is, if a
Wikipedia visitor who reads a Network entry then goes to the Manuel Castells entry, Will the system understand there is a connexion between them? Will next time put them together when presenting search results?
Any suggestion to solve these doubts is very welcome! Looking forward for Wikimania! Mayo
«·´`·.(*·.¸(`·.¸ ¸.·´)¸.·*).·´`·» «·´¨*·¸¸« Mayo Fuster Morell ».¸.·*¨`·» «·´`·.(¸.·´(¸.·* *·.¸)`·.¸).·´`·»
Research Digital Commons Governance: http://www.onlinecreation.info European University Institute - Phd Candidate School of information Berkeley Visiting researcher Phone Italy: 0039-3345440747 or 0039-0558409982 Phone Spanish State: 0034-648877748 E-mail: mayo.fuster@eui.eu Skype: mayoneti Identi.ca: Mayo Postal address: Badia Fiesolana - Via dei Roccettini 9, I-50014 San Domenico di Fiesole (FI) - Italy Fax [+39] 055 4685 201
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Seems like some of these questions seem like they need data not just on revisions, but on web visitors, and click stream patterns. Does the foundation collect or make web access data available?
--J
On Apr 11, 2010, at 17:14, Luca de Alfaro wrote:
Hi Mayo,
On Sun, Apr 11, 2010 at 9:06 AM, Fuster, Mayo Mayo.Fuster@eui.eu wrote:
Hi everybody!
How are you? I hope happy and fine. I am Mayo Fuster Morell doing a Phd research on Wikipedia governance at the European University Institute.
I would appreciate if you could help me with three specific doubts that I have on Wikipedia data.
- Is there data or research results on number of users per article? Plus,
what is the more frequent number of users per article? Or, what is the distribution of number of editors/article?
I was about to say that this can be derived easily from an analysis of the
revisions database table, but then I noticed that no dump of this table in isolation is available from download.wikimedia.org... You can however access all this data by using the Wikipedia API ( http://www.mediawiki.org/wiki/API) but it requires some programming.
- Is there data on gender distribution? I know there is the general survey
(which say 12.83% women are contributors) and Ortega works. Is there any other data on gender distribution?.
- Does the site learn from the navigation and searches? That is, if a
Wikipedia visitor who reads a Network entry then goes to the Manuel Castells entry, Will the system understand there is a connexion between them? Will next time put them together when presenting search results?
Any suggestion to solve these doubts is very welcome! Looking forward for Wikimania! Mayo
«·´`·.(*·.¸(`·.¸ ¸.·´)¸.·*).·´`·» «·´¨*·¸¸« Mayo Fuster Morell ».¸.·*¨`·» «·´`·.(¸.·´(¸.·* *·.¸)`·.¸).·´`·»
Research Digital Commons Governance: http://www.onlinecreation.info European University Institute - Phd Candidate School of information Berkeley Visiting researcher Phone Italy: 0039-3345440747 or 0039-0558409982 Phone Spanish State: 0034-648877748 E-mail: mayo.fuster@eui.eu Skype: mayoneti Identi.ca: Mayo Postal address: Badia Fiesolana - Via dei Roccettini 9, I-50014 San Domenico di Fiesole (FI) - Italy Fax [+39] 055 4685 201
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
On Sun, Apr 11, 2010 at 11:45 PM, James Howison james@howison.name wrote:
Seems like some of these questions seem like they need data not just on revisions, but on web visitors, and click stream patterns. Does the foundation collect or make web access data available?
The mediawiki API described at http://en.wikipedia.org/w/api.php includes * action=clicktracking * Track user clicks on JavaScript items. * action=specialclicktracking * Returns data to the Special:ClickTracking visualization page
I tried to call them but you need to have admin rights, I think. If you want to try it, you should do it on a mediawiki of which you are admin. If you try, I will be interesting to know what these methods return.
P.
On 4/12/10 12:31 AM, paolo massa wrote:
On Sun, Apr 11, 2010 at 11:45 PM, James Howisonjames@howison.name wrote:
Seems like some of these questions seem like they need data not just on revisions, but on web visitors, and click stream patterns. Does the foundation collect or make web access data available?
The mediawiki API described at http://en.wikipedia.org/w/api.php includes
- action=clicktracking * Track user clicks on JavaScript items.
- action=specialclicktracking * Returns data to the Special:ClickTracking visualization page
I tried to call them but you need to have admin rights, I think. If you want to try it, you should do it on a mediawiki of which you are admin. If you try, I will be interesting to know what these methods return.
P.
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
For general aggregate data, we have this: http://stats.wikimedia.org/ which is based purely on our server logs.
The API clicktracking currently is very limited and is mostly just a tool we've been using to see what kind of users click what buttons/navigation elements so the usability initiative (usability.wikimedia.org) can rearrange icons and toolbars in a useful way.
On 4/11/10 2:14 PM, Luca de Alfaro wrote:
Hi Mayo,
On Sun, Apr 11, 2010 at 9:06 AM, Fuster, Mayo <Mayo.Fuster@eui.eu mailto:Mayo.Fuster@eui.eu> wrote:
Hi everybody! How are you? I hope happy and fine. I am Mayo Fuster Morell doing a Phd research on Wikipedia governance at the European University Institute. I would appreciate if you could help me with three specific doubts that I have on Wikipedia data. * Is there data or research results on number of users per article? Plus, what is the more frequent number of users per article? Or, what is the distribution of number of editors/article?
I was about to say that this can be derived easily from an analysis of the revisions database table, but then I noticed that no dump of this table in isolation is available from download.wikimedia.org... You can however access all this data by using the Wikipedia API (http://www.mediawiki.org/wiki/API) but it requires some programming.
Actually, if you go to:
http://download.wikipedia.org/%7B%7Bwiki%7D%7D/latest/%7B%7Bwiki%7D%7D-lates...
and replace {{wiki}} with the wiki you're interested in (for instance, enwiki for english wikipedia) you'll basically get the revision table without the actual data of the text, it's a little bit smaller and more manageable that way. You'd have to do some analysis on it, but this could at least get you this information.
On Sun, Apr 11, 2010 at 12:06 PM, Fuster, Mayo Mayo.Fuster@eui.eu wrote:
- Does the site learn from the navigation and searches? That is, if a
Wikipedia visitor who reads a Network entry then goes to the Manuel Castells entry, Will the system understand there is a connexion between them? Will next time put them together when presenting search results?
No.
Although that is an interesting area of research.
Unfortunately, due to privacy concerns the data that would be required to invent such a system (search strings and search click through traces) is not available to the public. (and in fact, the traces aren't really collected, currently, as far as I know)
Thank you Gregory. I didn't want to imply that the data should be public. But if Wikipedia system "learn" from the searchers and the navigation of the visitors.
The reason why I am interested on this question is because I am discussing if and why even only using (without contributing) can be beneficiary for Wikipedia. In the line of side-effects, "contributing without intending to contribute".
Thank you again. Have a nice day! Mayo
«·´`·.(*·.¸(`·.¸ ¸.·´)¸.·*).·´`·» «·´¨*·¸¸« Mayo Fuster Morell ».¸.·*¨`·» «·´`·.(¸.·´(¸.·* *·.¸)`·.¸).·´`·»
Research Digital Commons Governance: http://www.onlinecreation.info European University Institute - Phd Candidate School of information Berkeley Visiting researcher Phone Italy: 0039-3345440747 or 0039-0558409982 Phone Spanish State: 0034-648877748 E-mail: mayo.fuster@eui.eu Skype: mayoneti Identi.ca: Mayo Postal address: Badia Fiesolana - Via dei Roccettini 9, I-50014 San Domenico di Fiesole (FI) - Italy Fax [+39] 055 4685 201
-----Missatge original----- De: wiki-research-l-bounces@lists.wikimedia.org en nom de Gregory Maxwell Enviat el: dg. 11/04/2010 23:47 Per a: Research into Wikimedia content and communities Tema: Re: [Wiki-research-l] Help to solve three doubts on Wikipediaresearch data
On Sun, Apr 11, 2010 at 12:06 PM, Fuster, Mayo Mayo.Fuster@eui.eu wrote:
- Does the site learn from the navigation and searches? That is, if a
Wikipedia visitor who reads a Network entry then goes to the Manuel Castells entry, Will the system understand there is a connexion between them? Will next time put them together when presenting search results?
No.
Although that is an interesting area of research.
Unfortunately, due to privacy concerns the data that would be required to invent such a system (search strings and search click through traces) is not available to the public. (and in fact, the traces aren't really collected, currently, as far as I know)
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hello,
Gregory (? if I remember well) mentioned in August 2009 this: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1446862 All examined sites spy on their visitors, but Wikimedia and Wikipedia.
Kind regards Ziko
2010/4/11 Gregory Maxwell gmaxwell@gmail.com:
On Sun, Apr 11, 2010 at 12:06 PM, Fuster, Mayo Mayo.Fuster@eui.eu wrote:
- Does the site learn from the navigation and searches? That is, if a
Wikipedia visitor who reads a Network entry then goes to the Manuel Castells entry, Will the system understand there is a connexion between them? Will next time put them together when presenting search results?
No.
Although that is an interesting area of research.
Unfortunately, due to privacy concerns the data that would be required to invent such a system (search strings and search click through traces) is not available to the public. (and in fact, the traces aren't really collected, currently, as far as I know)
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
I guess that Wiki(pedia|media) could very well gather statistics on
(revision_id, clicked_link)
pairs without compromising the anonimity of the visitors. It would be very useful to have indications on which hyperlinks are most useful. For example, I am always curious whether the large editorial effort to curate categories is worth it. And also, if one had data on:
(revision_id, "search terms used in next search"),
one could infer which links are actually missing. The problem is that many people use search engines rather than Wikipedia's own search to navigate the Wikipedia... but perhaps the information could still be reconstructed somehow from session information.
But as far as I know, there is no plan nor current infrastructure to have such anonymously logged data. I don't work there, however, so other better-informed people might comment.
Luca
On Sun, Apr 11, 2010 at 3:19 PM, Ziko van Dijk zvandijk@googlemail.comwrote:
Hello,
Gregory (? if I remember well) mentioned in August 2009 this: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1446862 All examined sites spy on their visitors, but Wikimedia and Wikipedia.
Kind regards Ziko
2010/4/11 Gregory Maxwell gmaxwell@gmail.com:
On Sun, Apr 11, 2010 at 12:06 PM, Fuster, Mayo Mayo.Fuster@eui.eu
wrote:
- Does the site learn from the navigation and searches? That is, if a
Wikipedia visitor who reads a Network entry then goes to the Manuel
Castells
entry, Will the system understand there is a connexion between them?
Will
next time put them together when presenting search results?
No.
Although that is an interesting area of research.
Unfortunately, due to privacy concerns the data that would be required to invent such a system (search strings and search click through traces) is not available to the public. (and in fact, the traces aren't really collected, currently, as far as I know)
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Ziko van Dijk NL-Silvolde
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
On Sun, Apr 11, 2010 at 6:27 PM, Luca de Alfaro luca@dealfaro.org wrote:
I guess that Wiki(pedia|media) could very well gather statistics on
(revision_id, clicked_link)
pairs without compromising the anonimity of the visitors. It would be very useful to have indications on which hyperlinks are most useful. For example, I am always curious whether the large editorial effort to curate categories is worth it. And also, if one had data on:
(revision_id, "search terms used in next search"),
one could infer which links are actually missing.
Seems to me like that (especially the latter), would need to be done extremely carefully to avoid compromising the anonymity of the visitors. Although it's not quite as bad it seems reminiscent of the " http://en.wikipedia.org/wiki/AOL_search_data_scandal", especially with regard to "search terms used in next search".
The first thing I proposed is innocuous (gathering stats on (revision_id, clicked_link)), and in fact can be done easily with a minimum of instrumentation.
The second is very different from the AOL search data. The AOL search data was problematic because it associated data on a per-user basis, so you could use some queries to figure out who the user was, and then see the other queries of the user. I am suggesting here to instead gather anonymous statistics on: (<was on page A>, did a search, <landed on page B>), keeping track only of the (A, B) pairs, without user information.
But the problem is that gathering such anonymous logs takes effort, is difficult to do securely, is difficult to avoid someone tamper with it and add back information that should not have been there, and it is difficult to then present the information to Wikipedia editors in a way that helps them meaningfully improve pages.
So perhaps the first statistic is the only useful one.
I would also be curious to know, once a user enters, what % of next visits are due to the visitor clicking on links, vs. doing a search.
Luca
On Sun, Apr 11, 2010 at 6:28 PM, Anthony wikimail@inbox.org wrote:
On Sun, Apr 11, 2010 at 6:27 PM, Luca de Alfaro luca@dealfaro.org wrote:
I guess that Wiki(pedia|media) could very well gather statistics on
(revision_id, clicked_link)
pairs without compromising the anonimity of the visitors. It would be very useful to have indications on which hyperlinks are most useful. For example, I am always curious whether the large editorial effort to curate categories is worth it. And also, if one had data on:
(revision_id, "search terms used in next search"),
one could infer which links are actually missing.
Seems to me like that (especially the latter), would need to be done extremely carefully to avoid compromising the anonymity of the visitors. Although it's not quite as bad it seems reminiscent of the " http://en.wikipedia.org/wiki/AOL_search_data_scandal", especially with regard to "search terms used in next search".
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
On Sun, Apr 11, 2010 at 6:19 PM, Ziko van Dijk zvandijk@googlemail.com wrote:
Hello,
Gregory (? if I remember well) mentioned in August 2009 this: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1446862 All examined sites spy on their visitors, but Wikimedia and Wikipedia.
It's possible to track click progress without setting tracking cookies. However.
You can form an {IP address, Useragent} tuple for every search then make the assumption that subsequent page loads are the same client.
This is less accurate than cookie based full tracking, but it should be sufficient for training a machine learning system for predictive search results. Especially if we make the reasonable assumption that users at the same location are already more likely to be looking at similar materials.
You could also insert a tracking token in the search result HTTP get, which would give you very accurate data but only for the clicks directly off the search page.
Again, not a maximum amount of information, but likely sufficient and it doesn't involve any deep privacy violating tracking.
The fact that there are only a few wikimedia personell who are able to access the information about browsing trails, and a few community representatives who can check the IP's for registered users doesn't mean Wikimedia doesn't spy. It spys heavily on editing, and then offers some of the information back to the community. That research was just focused on Flash cookies, not general ability to get information about users activities. If it doesn't store any IP address => HTTP GET URL information it would be making itself very open to DOS attacks that it wouldn't have any information to use to defend itself.
Maybe people should be educated in the ways they can defend against the systematic privacy issues like flash cookies and single pixel tracking cookies etc.,. I defend against long term profiling using the NoScript [1], BetterPrivacy [2] and Adblock Plus [3] addons for Firefox. Any flash or DOM storage that I actually allow, will still get wiped everytime the browser reboots, along with all of the cookies (only session cookies allowed in my browser settings). What a website learns within a single session of my browsing is their business as long as their Privacy Policy is accurate. If I didn't want it to be their business I would use a VPN or another anonimising program to further restrict the amount of information they can usefully put together about me. The amount of information they get shouldn't be reliant on me. Even the User-Agent can be spoofed to a generic string to deidentify the masses who are using the same method of anonymisation on the same website (ie, TorButton [4]).
If you are personally worried, there are many many ways to protect yourself on the web. However, the majority of people won't ever realise their existence or see their importance unless a website started saying exactly what they knew about a person (especially a cross-domain entity like googlesyndication.com which incidentally papers.ssrn.com attempts to make me use if I didn't have adblock plus and noscript enabled to protect against)
Cheers,
Peter
[1] https://addons.mozilla.org/en-US/firefox/addon/722 [2] https://addons.mozilla.org/en-US/firefox/addon/6623 [3] https://addons.mozilla.org/en-US/firefox/addon/1865 [4] https://addons.mozilla.org/en-US/firefox/addon/2275
On 12 April 2010 08:19, Ziko van Dijk zvandijk@googlemail.com wrote:
Hello,
Gregory (? if I remember well) mentioned in August 2009 this: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1446862 All examined sites spy on their visitors, but Wikimedia and Wikipedia.
Kind regards Ziko
2010/4/11 Gregory Maxwell gmaxwell@gmail.com:
On Sun, Apr 11, 2010 at 12:06 PM, Fuster, Mayo Mayo.Fuster@eui.eu wrote:
- Does the site learn from the navigation and searches? That is, if a
Wikipedia visitor who reads a Network entry then goes to the Manuel Castells entry, Will the system understand there is a connexion between them? Will next time put them together when presenting search results?
No.
Although that is an interesting area of research.
Unfortunately, due to privacy concerns the data that would be required to invent such a system (search strings and search click through traces) is not available to the public. (and in fact, the traces aren't really collected, currently, as far as I know)
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Ziko van Dijk NL-Silvolde
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org