I am doing a PhD on online civic participation project
(e-participation). Within my research, I have carried out a user
survey, where I asked how many people ever edited/created a page on a
Wiki. Now I would like to compare the results with the overall rate of
wiki editing/creation on country level.
I've found some country-level statistics on Wikipedia Statistics (e.g.
3,000 editors of Wikipedia articles in Italy) but data for UK and
France are not available since Wikipedia provides statistics by
languages, not by countries. I'm thus looking for statistics on UK and
France (but am also interested in alternative ways of measuring wiki
editing/creation in Sweden and Italy).
I would be grateful for any tips!
Sunny regards, Alina
European University Institute
I'm starting a new project, a wiki search engine. It uses MediaWiki,
Semantic MediaWiki and other minor extensions, and some tricky templates
I remember Wikia Search and how it failed. It had the mini-article thingy
for the introduction, and then a lot of links compiled by a crawler. Also
something similar to a social network.
My project idea (which still needs a cool name) is different. Althought it
uses an introduction and images copied from Wikipedia, and some links from
the "External links" sections, it is only a start. The purpose is that
community adds, removes and orders the results for each term, and creates
redirects for similar terms to avoid duplicates.
Why this? I think that Google PageRank isn't enough. It is frequently
abused by farmlinks, SEOs and other people trying to put their websites
Search "Shakira" in Google for example. You see 1) Official site, 2)
Wikipedia 3) Twitter 4) Facebook, then some videos, some news, some images,
Myspace. It wastes 3 or more results in obvious nice sites (WP, TW, FB).
The wiki search engine puts these sites in the top, and an introduction and
related terms, leaving all the space below to not so obvious but
interesting websites. Also, if you search for "semantic queries" like
"right-wing newspapers" in Google, you won't find real newspapers but
"people and sites discussing about ring-wing newspapers". Or latex and
LaTeX being shown in the same results pages. These issues can be resolved
with disambiguation result pages.
How we choose which results are above or below? The rules are not fully
designed yet, but we can put official sites in the first place, then .gov
or .edu domains which are important ones, and later unofficial websites,
blogs, giving priority to local language, etc. And reaching consensus.
We can control aggresive spam with spam blacklists, semi-protect or protect
highly visible pages, and use bots or tools to check changes.
It obviously has a CC BY-SA license and results can be exported. I think
that this approach is the opposite to Google today.
For weird queries like "Albert Einstein birthplace" we can redirect to the
most obvious results page (in this case Albert Einstein) using a hand-made
redirect or by software (some little change in MediaWiki).
You can check a pretty alpha version here http://www.todogratix.es (only
Spanish by now sorry) which I'm feeding with some bots.
I think that it is an interesting experiment. I'm open to your questions
Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
Pre-doctoral student at the University of Cádiz (Spain)
Projects: AVBOT <http://code.google.com/p/avbot/> |
| WikiEvidens <http://code.google.com/p/wikievidens/> |
| WikiTeam <http://code.google.com/p/wikiteam/>
Personal website: https://sites.google.com/site/emijrp/
I'm wondering if anyone has done any research into identifying which
articles in Wikipedia have associated video?
There is this category, which only has 280 or so articles:
It seems far from complete. Appreciate any advice or previous work in this
The background: I'm working with some grad students on staging a Wiki Makes
Video contest in April, and we'd like to do some measurement of the current
state of video in Wikipedia.
Thanks, and email me if you'd like to know more about the video project for
"This gem will perform a whitepaper lookup on major scholarly databases.
Its purpose is to easily find related papers and organize your paper
collection. With this application, you can easily download pdfs or use
it as a library to automatically assign metadata.
"Currently, CiteSeerX, ACM and IEEE are the only databases it uses along
with a google pdf/ps search to find other pdf or ps links to download."
The author says it is just for personal use.
Engineering Community Manager
I am working on creating a single entry page describing all the data about
Wikipedia and WMF projects available for researchers. The idea is to have a
single location, which introduces all possible source of data and makes it
easy for a newbie to understand what suits his/her needs and how to get and
work with the data. This is meant to be useful to the users ( which is
you), so I have a few questions to help me make it better:
1. I was wondering if any of you has used data from sources other than
the listed below and if yes, what? • XML dumps
• the API
• the Toolserver (or it's future replacement on WMF Labs)
• our live IRC feeds
• our raw hourly pageview data dumps (and the rudimentary API that you
can use to query them atstats.grok.se)
• the sources listed on our (experimental) open data registry on the
DataHub http://datahub.io/group/wikimedia (includes DBpedia)
2. Is there any specific information that you wished you had known when
you started using WMF data but is not documented online?
3. Do you have any datasets or tools for
parsing/manipulating/visualizing data, which you think can be reused and
you want to share? (Could be something you built or something you found and
4. What information should be included about each source. I am thinking
1. description of the data - content, format , method of collection
or how you can collect it, how often it is collected, for what period
2. skills required to get and work with the data ( PHP, SQL, etc.)
3. short sample
4. existing tools - for parsing, importing, etc.
5. maybe examples of projects where it was used?
Any other comments/suggestions will be appreciated.
Thank you in advance.
can you recommend any FLOSS tool for visualizing the following
for a certain sample from namespaces 0 and 1
* language linking revisions for a sample of topically related pages
("how has interwiki linking been revised over time in this and that article in
this and that version?")
* individual user activity in more than one WP version, in a sample group of
about 30 languages
("which other version has user x contributed to on this topic?")
* temporal aspects of possible relatedness in revision frequency
("when was revision activity highest in certain language versions of this
thanks & cheers,
Wikipedia admins are editors entrusted with special privileges and
duties, responsible for the community management of Wikipedia. They
are elected using a special procedure defined by the Wikipedia
community, called Request for Adminship (RfA). Because of the growing
amount of management work (quality control, coordination, maintenance)
on the Wikipedia, the importance of admins is growing. At the same
time, there exists evidence that the admin community is growing more
slowly than expected. We present an analysis of the RfA procedure in
the Polish-language Wikipedia, since the procedure’s introduction in
2005. With the goal of discovering good candidates for new admins that
could be accepted by the community, we model the admin elections using
multidimensional behavioral social networks derived from the Wikipedia
edit history. We find that we can classify the votes in the RfA
procedures using this model with an accuracy level that should be
sufficient to recommend candidates. We also propose and verify
interpretations of the dimensions of the social network. We find that
one of the dimensions, based on discussion on Wikipedia talk pages,
can be validly interpreted as acquaintance among editors, and discuss
the relevance of this dimension to the admin elections.
>From the conclusion:
"[...] We have noticed the decreasing amount of successful admin
elections and have formulated two hypotheses that could explain this
phenomenon. Hypothesis A stated that new admins are elected on the
basis of acquaintance of the voter and candidate. If this would be a
valid explanation, we could conclude that the community of admins is
becoming increasingly closed, which would be detrimental to the
sustainable development of the Wikipedia.
Hypothesis B stated that new admins are elected on the basis of
similarity of experience in editing various topics of the voter and
candidate. Since voters are other active admins whose experience
increases with time, their thresholds of accepting a candidate are
likely to increase (as has been observed from the simple statistics of
I would love to see this research on other Wikipedias.
Everton Zanella Alvarenga (also Tom)
"A life spent making mistakes is not only more honorable, but more
useful than a life spent doing nothing."
We gladly announce the source code release of our named entity disambiguation system AIDA: Accurate Online Disambiguation of Named Entities.
Given a natural-language text, AIDA maps mentions of ambiguous names onto canonical entities (people, places, etc.) registered in a knowledge base such as Yago (http://yago-knowledge.org), which is linked to Wikipedia.
For example, in the sentence "When Page played Kashmir at Knebwoth, his Les Paul was uniquely tuned", AIDA would identify "Page" with Jimmy_Page, "Kashmir" with Kashmir_Song, "Knebworth" with the festival, and "Les Paul" with the Gibson_Les_Paul guitar.
The source code of AIDA is available at github under the CC-BY-NC-SA license: https://github.com/yago-naga/aida/
and at the AIDA project page: http://www.mpi-inf.mpg.de/yago-naga/aida/
The AIDA team
at the Max Planck Institute for Informatics