This paper (first reference) is the result of a class project I was part of
almost two years ago for CSCI 5417 Information Retrieval Systems. It builds
on a class project I did in CSCI 5832 Natural Language Processing and which
I presented at Wikimania '07. The project was very late as we didn't send
the final paper in until the day before new years. This technical report was
never really announced that I recall so I thought it would be interesting to
look briefly at the results. The goal of this paper was to break articles
down into surface features and latent features and then use those to study
the rating system being used, predict article quality and rank results in a
search engine. We used the [[random forests]] classifier which allowed us to
analyze the contribution of each feature to performance by looking directly
at the weights that were assigned. While the surface analysis was performed
on the whole english wikipedia, the latent analysis was performed on the
simple english wikipedia (it is more expensive to compute). = Surface
features = * Readability measures are the single best predictor of quality
that I have found, as defined by the Wikipedia Editorial Team (WET). The
[[Automated Readability Index]], [[Gunning Fog Index]] and [[Flesch-Kincaid
Grade Level]] were the strongest predictors, followed by length of article
html, number of paragraphs, [[Flesh Reading Ease]], [[Smog Grading]], number
of internal links, [[Laesbarhedsindex Readability Formula]], number of words
and number of references. Weakly predictive were number of to be's, number
of sentences, [[Coleman-Liau Index]], number of templates, PageRank, number
of external links, number of relative links. Not predictive (overall - see
the end of section 2 for the per-rating score breakdown): Number of h2 or
h3's, number of conjunctions, number of images*, average word length, number
of h4's, number of prepositions, number of pronouns, number of interlanguage
links, average syllables per word, number of nominalizations, article age
(based on page id), proportion of questions, average sentence length. :*
Number of images was actually by far the single strongest predictor of any
class, but only for Featured articles. Because it was so good at picking out
featured articles and somewhat good at picking out A and G articles the
classifier was confused in so many cases that the overall contribution of
this feature to classification performance is zero. :* Number of external
links is strongly predictive of Featured articles. :* The B class is highly
distinctive. It has a strong "signature," with high predictive value
assigned to many features. The Featured class is also very distinctive. F, B
and S (Stop/Stub) contain the most information.
:* A is the least distinct class, not being very different from F or G. =
Latent features = The algorithm used for latent analysis, which is an
analysis of the occurence of words in every document with respect to the
link structure of the encyclopedia ("concepts"), is [[Latent Dirichlet
Allocation]]. This part of the analysis was done by CS PhD student Praful
Mangalath. An example of what can be done with the result of this analysis
is that you provide a word (a search query) such as "hippie". You can then
look at the weight of every article for the word hippie. You can pick the
article with the largest weight, and then look at its link network. You can
pick out the articles that this article links to and/or which link to this
article that are also weighted strongly for the word hippie, while also
contributing maximally to this articles "hippieness". We tried this query in
our system (LDA), Google (site:en.wikipedia.org hippie), and the Simple
English Wikipedia's Lucene search engine. The breakdown of articles occuring
in the top ten search results for this word for those engines is: * LDA
only: [[Acid rock]], [[Aldeburgh Festival]], [[Anne Murray]], [[Carl
Radle]], [[Harry Nilsson]], [[Jack Kerouac]], [[Phil Spector]], [[Plastic
Ono Band]], [[Rock and Roll]], [[Salvador Allende]], [[Smothers brothers]],
[[Stanley Kubrick]]. * Google only: [[Glam Rock]], [[South Park]]. * Simple
only: [[African Americans]], [[Charles Manson]], [[Counterculture]], [[Drug
use]], [[Flower Power]], [[Nuclear weapons]], [[Phish]], [[Sexual
liberation]], [[Summer of Love]] * LDA & Google & Simple: [[Hippie]],
[[Human Be-in]], [[Students for a democratic society]], [[Woodstock
festival]] * LDA & Google: [[Psychedelic Pop]] * Google & Simple: [[Lysergic
acid diethylamide]], [[Summer of Love]] ( See the paper for the articles
produced for the keywords philosophy and economics ) = Discussion /
Conclusion = * The results of the latent analysis are totally up to your
perception. But what is interesting is that the LDA features predict the WET
ratings of quality just as well as the surface level features. Both feature
sets (surface and latent) both pull out all almost of the information that
the rating system bears. * The rating system devised by the WET is not
distinctive. You can best tell the difference between, grouped together,
Featured, A and Good articles vs B articles. Featured, A and Good articles
are also quite distinctive (Figure 1). Note that in this study we didn't
look at Start's and Stubs, but in earlier paper we did. :* This is
interesting when compared to this recent entry on the YouTube blog. "Five
Stars Dominate Ratings"
I think a sane, well researched (with actual subjects) rating system
well within the purview of the Usability Initiative. Helping people find and
create good content is what Wikipedia is all about. Having a solid rating
system allows you to reorganized the user interface, the Wikipedia
namespace, and the main namespace around good content and bad content as
needed. If you don't have a solid, information bearing rating system you
don't know what good content really is (really bad content is easy to spot).
:* My Wikimania talk was all about gathering data from people about articles
and using that to train machines to automatically pick out good content. You
ask people questions along dimensions that make sense to people, and give
the machine access to other surface features (such as a statistical measure
of readability, or length) and latent features (such as can be derived from
document word occurence and encyclopedia link structure). I referenced page
262 of Zen and the Art of Motorcycle Maintenance to give an example of the
kind of qualitative features I would ask people. It really depends on what
features end up bearing information, to be tested in "the lab". Each word is
an example dimension of quality: We have "*unity, vividness, authority,
economy, sensitivity, clarity, emphasis, flow, suspense, brilliance,
precision, proportion, depth and so on.*" You then use surface and latent
features to predict these values for all articles. You can also say, when a
person rates this article as high on the x scale, they also mean that it has
has this much of these surface and these latent features.
= References =
- DeHoust, C., Mangalath, P., Mingus., B. (2008). *Improving search in
Wikipedia through quality and concept discovery*. Technical Report.
- Rassbach, L., Mingus., B, Blackford, T. (2007). *Exploring the
feasibility of automatically rating online article quality*. Technical
I have asked and received permission to forward to you all this most
excellent bit of news.
The linguist list, is a most excellent resource for people interested in the
field of linguistics. As I mentioned some time ago they have had a funding
drive and in that funding drive they asked for a certain amount of money in
a given amount of days and they would then have a project on Wikipedia to
learn what needs doing to get better coverage for the field of linguistics.
What you will read in this mail that the total community of linguists are
asked to cooperate. I am really thrilled as it will also get us more
linguists interested in what we do. My hope is that a fraction will be
interested in the languages that they care for and help it become more
relevant. As a member of the "language prevention committee", I love to get
more knowledgeable people involved in our smaller projects. If it means that
we get more requests for more projects we will really feel embarrassed with
all the new projects we will have to approve because of the quality of the
Incubator content and the quality of the linguistic arguments why we should
approve yet another language :)
NB Is this not a really clever way of raising money; give us this much in
this time frame and we will then do this as a bonus...
---------- Forwarded message ----------
From: LINGUIST Network <linguist(a)linguistlist.org>
Date: Jun 18, 2007 6:53 PM
Subject: 18.1831, All: Call for Participation: Wikipedia Volunteers
LINGUIST List: Vol-18-1831. Mon Jun 18 2007. ISSN: 1068 - 4875.
Subject: 18.1831, All: Call for Participation: Wikipedia Volunteers
Moderators: Anthony Aristar, Eastern Michigan U <aristar(a)linguistlist.org>
Helen Aristar-Dry, Eastern Michigan U <hdry(a)linguistlist.org>
Reviews: Laura Welcher, Rosetta Project
The LINGUIST List is funded by Eastern Michigan University,
and donations from subscribers and publishers.
Editor for this issue: Ann Sawyer <sawyer(a)linguistlist.org>
To post to LINGUIST, use our convenient web form at
From: Hannah Morales < hannah(a)linguistlist.org >
Subject: Wikipedia Volunteers
-------------------------Message 1 ----------------------------------
Date: Mon, 18 Jun 2007 12:49:35
From: Hannah Morales < hannah(a)linguistlist.org >
Subject: Wikipedia Volunteers
As you may recall, one of our Fund Drive 2007 campaigns was called the
"Wikipedia Update Vote." We asked our viewers to consider earmarking their
donations to organize an update project on linguistics entries in the
English-language Wikipedia. You can find more background information on this
The speed with which we met our goal, thanks to the interest and generosity
our readers, was a sure sign that the linguistics community was enthusiastic
about the idea. Now that summer is upon us, and some of you may have a bit
leisure time, we are hoping that you will be able to help us get started on
Wikipedia project. The LINGUIST List's role in this project is a purely
organizational one. We will:
*Help, with your input, to identify major gaps in the Wikipedia materials or
pages that need improvement;
*Compile a list of linguistics pages that Wikipedia editors have identified
"in need of attention from an expert on the subject" or " does not cite any
references or sources," etc;
*Send out periodical calls for volunteer contributors on specific topics or
*Provide simple instructions on how to upload your entries into Wikipedia;
*Keep track of our project Wikipedians;
*Keep track of revisions and new entries;
*Work with Wikimedia Foundation to publicize the linguistics community's
We hope you are as enthusiastic about this effort as we are. Just to help us
get started looking at Wikipedia more critically, and to easily identify an
needing improvement, we suggest that you take a look at the List of
Many people are not listed there; others need to have more facts and
added. If you would like to participate in this exciting update effort,
respond by sending an email to LINGUIST Editor Hannah Morales at
hannah(a)linguistlist.org, suggesting what your role might be or which
entries you feel should be updated or added. Some linguists who saw our
on the Internet have already written us with specific suggestions, which we
share with you soon.
This update project will take major time and effort on all our parts. The
result will be a much richer internet resource of information on the breadth
depth of the field of linguistics. Our efforts should also stimulate
students to consider studying linguistics and to educate a wider public on
we do. Please consider participating.
Editor, Wikipedia Update Project
Linguistic Field(s): Not Applicable
LINGUIST List: Vol-18-1831
There is a request for a Wikipedia in Ancient Greek. This request has so far
been denied. A lot of words have been used about it. Many people maintain
their positions and do not for whatever reason consider the arguments of
In my opinion their are a few roadblocks.
- Ancient Greek is an ancient language - the policy does not allow for
- Text in ancient Greek written today about contemporary subjects
require the reconstruction of Ancient Greek.
- it requires the use of existing words for concepts that did
not exist at the time when the language was alive
- neologisms will be needed to describe things that did not
exist at the time when the language was alive
- modern texts will not represent the language as it used to be
- Constructed and by inference reconstructed languages are effectively
We can change the policy if there are sufficient arguments, when we agree on
When a text is written in reconstructed ancient Greek, and when it is
clearly stated that it is NOT the ancient Greek of bygone days, it can be
obvious that it is a great tool to learn skills to read and write ancient
Greek but that it is in itself not Ancient Greek. Ancient Greek as a
language is ancient. I have had a word with people who are involved in the
working group that deals with the ISO-639, I have had a word with someone
from SIL and it is clear that a proposal for a code for "Ancient Greek
reconstructed" will be considered for the ISO-639-3. For the ISO-639-6 a
code is likely to be given because a clear use for this code can be given.
We can apply for a code and as it has a use bigger then Wikipedia alone it
clearly has merit.
With modern texts clearly labelled as distinct from the original language,
it will be obvious that innovations a writers needs for his writing are
This leaves the fact that constructed and reconstructed languages are not
permitted because of the notion that mother tongue users are required. In my
opinion, this has always been only a gesture to those people who are dead
set against any and all constructed languages. In the policies there is
something vague "*it must have a reasonable degree of recognition as
determined by discussion (this requirement is being discussed by the language
subcommittee <http://meta.wikimedia.org/wiki/Language_subcommittee>)."* It
is vague because even though the policy talks about a discussion, it is
killed off immediately by stating "The proposal has a sufficient number of
living native speakers to form a viable community and audience." In my
opinion, this discussion for criteria for the acceptance of constructed or
reconstructed languages has not happened. Proposals for objective criteria
have been ignored.
In essence, to be clear about it:
- We can get a code for reconstructed languages.
- We need to change the policy to allow for reconstructed and
We need to do both in order to move forward.
The proposal for objective criteria for constructed and reconstructed
languages is in a nutshell:
- The language must have an ISO-639-3 code
- We need full WMF localisation from the start
- The language must be sufficiently expressive for writing a modern
- The Incubator project must have sufficiently large articles that
demonstrate both the language and its ability to write about a wide range of
- A sufficiently large group of editors must be part of the Incubator
Let's see what we've got here:
A "Board" that appears answerable only to some god; an "Executive Director"
who answers only to this "Board"; a group of "Moderators" who claim (with a
straight face) that they are "independent", but whose "moderations" are
clearly designed to keep the first two in a favorable light; and, dead last,
you have the people who, not so ironically, create the substance of the
thing that makes the first three possible. This setup sounds achingly
familiar. And, like all similar setups throughout history, is set up to
on 10/20/10 12:44 AM, Virgilio A. P. Machado at vam(a)fct.unl.pt wrote:
> I agree with you. You raised some very good points.
> Virgilio A. P. Machado
> At 03:47 20-10-2010, you wrote:
>> ________________________________ From: Austin
>> Hair <adhair(a)gmail.com> To: Wikimedia Foundation
>> Mailing List <foundation-l(a)lists.wikimedia.org>
>> Sent: Tue, October 19, 2010 12:35:07 PM Subject:
>> Re: [Foundation-l] Greg Kohs and Peter Damian On
>> Mon, Oct 18, 2010 at 6:40 PM, Nathan
>> <nawrich(a)gmail.com> wrote: > If it pleases the
>> moderators, might we know on what basis Greg
>> was > banned and Peter indefinitely muzzled?
>> Greg Kohs was banned for the same reason that
>> he's been on moderation for the better part of
>> the past yearnamely, that he was completely
>> unable tto keep his contributions civil, and
>> caused more flamewars than constructive
>> discussion. Peter Damian is only on moderation,
>> and we'll follow our usual policy of letting
>> through anything that could be considered even
>> marginally acceptable. We really are very
>> liberal about thisotheerwise you wouldn't have
>> heard from Mr. Kohs at all in the past six
>> months. I'm sure that my saying this won't
>> convince anyone who's currently defending him,
>> but nothing about the decision to ban Greg Kohs
>> was retaliatory. I'll also (not for the first
>> time) remind everyone that neither the Wikimedia
>> Foundation Board, nor its staff, nor any chapter
>> or other organizational body has any say in the
>> administration of this list. I hope that clears
>> up all of the questions asked in this thread so
>> far. It is not about defending anyone but about
>> the fact that the "I know bannable when I see
>> it" theory of moderation is unconstructive and
>> leads to dramafests. The next ban is the one
>> that will likely cause a real flame war. I
>> suspect *more* people would be on moderation if
>> any sort of objective criteria were being
>> used. The lack of explanation over this bothers
>> me so much because I suspect that you *can't*
>> explain it. It seems to be the sort of gut-shot
>> that hasn't been thought through. Moderate more
>> people based on real criteria, rather than how
>> you feel about them. Birgitte
>> foundation-l mailing list
>> foundation-l(a)lists.wikimedia.org Unsubscribe:
> foundation-l mailing list
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>From what I have seen about Greg Kohs is that he does have some
interesting points to make, but I do see that he is jumping to
conclusions and does seem to have a biased viewpoint.
People want to make their own decisions and have enough information to
do that. We don't want to have important information deleted away
because it is uncomfortable.
Banning him makes it less likely for him to be heard, and these
interesting points which are worth considering are not heard my many
people : this is depriving people of critical information, that is not
fair to the people involved.
Just look at this article for example, it is quite interesting and
well written, and why should it not be visible to everyone on the
Deleting and banning people who say things that are not comfortable,
that does make you look balanced and trustworthy.
The Wikimedia foundation should be able to stand up to such
accusations without resorting to gagging people, it just gives more
credit to the people being gagged and makes people wonder if there is
any merit in what they say.
This brings up my favorite subject of unneeded deletions versions needed ones.
Of course there is material that should be deleted that is hateful,
Spam etc, lets call that evil content.
But the articles that i wrote and my friends wrote that were deleted
did not fall into that category, they might have been just bad or not
We have had a constant struggle to keep our articles from being
deleted in a manner that we consider unfair. Additionally, the bad
content is lost and falls into the same category as evil content.
Also there should be more transparency on deleted material on the
Wikipedia itself, there is a lot of information that is being deleted
and gone forever without proper process or review.
In my eyes there is a connection between the two topics, the banning
of people and the deleting of information. Both are depriving people
from information that they want and need in an unfair manner.
Instead of articles about obscure events, things, and old places in
Kosovo you have a wikipedia full of the latest information about every
television show, is that what you really want?
I think there should be room for things in places that are not not
notable because they are not part of mainstream pop culture, we also
need to support the underdogs of Wikipedia even if they are not
mainstream, Mr Kohs definitely has something to say and I would like
like to hear it. And the Kosovars have something to say even if the
Serbs don't want to hear it. The Albanians have something to say even
if the Greeks don't want to hear it, etc. There are many cases of
people from Kosovo and Albania driven out of Wikipedia and depriving
the project of important information because they are not able to get
started and the contributions are so far way from the dominating
political viewpoint of the opposite side that they don't even get a
chance to be heard.
We need to make a way for these people to be heard and to moderate the
conflicts better, that will make Wikipedia stronger and more robust.
Back in September we had an open community IRC meeting, where we
introduced the new Trustees and talked about various issues. It was
pretty successful and we discussed afterwards making such "community
meetings" a regular event.
I'd like to revive this idea :) I've made a proposal for having
community meetings on the first Saturday of the month:
Which would make the first upcoming meeting on February 5.
I proposed 17:00UTC as a time, but please discuss good days/times on
the talk page if you are interested in attending; we'll need to rotate
I envision this as not really a Q&A session like the staff office
hours, but rather as a chance for community members to get together
and talk about important issues in a structured way. To that end,
please add your proposed agenda items to the wiki. It would also be
great to have some volunteers to take notes/moderate.
Of course this is just an experiment -- but there seemed to be a lot
of interest in having such meetings, so I'd like to try it out. Let me
know what you think and if you'd be interested.
* I use this address for lists; send personal messages to phoebe.ayers
<at> gmail.com *
In his 10th anniversary address Jimmy Wales says: "Today is a great
moment to reflect on where we've been."
What my reflection brings up is that the single thing that probably
raised more controversy among the widest range of Wikimedians is not
the content of articles about sex, celebrities or geopolitical and
linguistic conflicts, but the procedures of appointing administrators.
It should have never been a big deal, but it is, in all projects in
The "administrator" privilege lumps together several very different permissions:
* blocking and unblocking
* deleting and restoring pages and versions of pages
* viewing deleted versions of pages
* protect and unprotect pages and edit protected pages
* some PendingChanges/FlaggedRevisions-related permissions, which i
haven't quite figured out yet :)
Now i, in general, think that these permissions should be given
liberally to as many reasonable Wikimedians as possible. I always
believed in it, and since most of these actions became visible in the
watchlist a few years ago, this belief became even stronger.
But some re-thinking is needed. The administrator privilege, as it is
now, should be retired and broken up to several separate privileges:
* protect, unprotect, edit protected, config PendingChanges on the page
* edit highly technical pages - the MediaWiki: namespace, common.css, etc.
* revert, delete/undelete, view deleted
The permission to revert, delete and undelete unprotected pages can be
given to those users who can create and move pages ("autoconfirmed").
There is no big functional difference between deleting a page and
deleting a paragraph in an existing page or doing a major re-write.
The difference between reverting and undoing is a matter of civility
and a lot of uncivil things can be done without permissions anyway.
Limiting these actions only to certain users is quite pointless.
Viewing deleted pages shouldn't be a big deal either. Deletion is not
so much eliminating non-notable topics and nonsense from existence, as
about separating them from encyclopedic articles. It shouldn't be a
big deal to let bored people read them somewhere. Eliminating
egregiously offensive and illegal content, major copyright violations
and BLP issues can be accomplished today with the oversight
Controlling Pending Changes, although i haven't figured out all of its
intricacies, is essentially an improved version of page protection. It
makes sense to give this permission to (many) selected people. It will
probably evolve over time, and i believe that it will evolve more
organically if conceptually separated from blocking and deletion.
Another comment about protection is that protecting system messages
(the MediaWiki: namespace) and sensitive CSS and JS pages (commons.css
etc.) is very different from protecting vandalism-prone articles
(Obama etc.). The protection of these technical pages and sensitive
articles should be a different concept.
The permission to block should be a separate one. Separating the
discussions about giving users the permission to protect pages and to
block vandals will not stop the holy wars, but it will focus them.
There will be no more comments such as:
* "User:PhDhistorian may be a good editor who understands
Verifiability and who can be trusted to edit sensitive BLP articles,
but he has personal grudges with User:FatMadonna and he may block her,
so he shouldn't be given the Administrator privilege."
* "User:VandalFighterGrrrl is excellent at patrolling RC, but she's
too inclusionist and shouldn't be given the right to decide about
All of the above is formulated in the English Wikipedia terms. I
believe that the English Wikipedia policies for deletion, protection
and blocking make a lot of sense and should be adopted by all
Wikipedias, but this obviously can't be forced on any Wikipedia. Other
projects may have very different understanding of these processes and
it's OK. I'm only talking about the technical separation of the
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
"We're living in pieces,
I want to live in peace." - T. Moore
Tonight, Egypt has ordered all operators to shut down their BGP
adjacencies with out of countries providers. This mean Egypt is
disconnected from the rest of the internet.
I am wondering, should we just close our site in support? That would
surely have a huge impact and show how much we care about free
information for everyone.
Apart from celebrating 10 years of Wikipedia 2011 is also an election year for the Board of Trustees of the Wikimedia Foundation.
As you may recall the board has three directly elected representatives on it which serve for two years. Currently those are Mindspillage, SJ and Wing. As in the past years we rely on an effective election committee to coordinate the elections for us. They not only guarantee that the election is overseen by an independent body, but they also make sure that the tremendous amount of work that needs to be done is taken care of. My job is to coordinate the formation of this committee.
This is a call for volunteers to serve on the election committee. If you feel that you can contribute to this committee, please contact me and give a small summary of why you think you would be able to help out with this process. Just to make sure we all understand: you cannot be part of the election committee if you are planning to be a candidate or are planning to support any candidate publicly. Deadline for any extra volunteers is January 22 th 12:00 UTC.
The timeline for the next steps in the process will be published somewhere in February by the election committee. So if you are interested in becoming a candidate, time to start preparing!
Jan-Bart de Vreede
Wikimedia Board of Trustees
Board Liason Election Committee
PS: Should you want to know more about the role of a board member: http://meta.wikimedia.org/wiki/Wikimedia_board_manual