This paper (first reference) is the result of a class project I was part of
almost two years ago for CSCI 5417 Information Retrieval Systems. It builds
on a class project I did in CSCI 5832 Natural Language Processing and which
I presented at Wikimania '07. The project was very late as we didn't send
the final paper in until the day before new years. This technical report was
never really announced that I recall so I thought it would be interesting to
look briefly at the results. The goal of this paper was to break articles
down into surface features and latent features and then use those to study
the rating system being used, predict article quality and rank results in a
search engine. We used the [[random forests]] classifier which allowed us to
analyze the contribution of each feature to performance by looking directly
at the weights that were assigned. While the surface analysis was performed
on the whole english wikipedia, the latent analysis was performed on the
simple english wikipedia (it is more expensive to compute). = Surface
features = * Readability measures are the single best predictor of quality
that I have found, as defined by the Wikipedia Editorial Team (WET). The
[[Automated Readability Index]], [[Gunning Fog Index]] and [[Flesch-Kincaid
Grade Level]] were the strongest predictors, followed by length of article
html, number of paragraphs, [[Flesh Reading Ease]], [[Smog Grading]], number
of internal links, [[Laesbarhedsindex Readability Formula]], number of words
and number of references. Weakly predictive were number of to be's, number
of sentences, [[Coleman-Liau Index]], number of templates, PageRank, number
of external links, number of relative links. Not predictive (overall - see
the end of section 2 for the per-rating score breakdown): Number of h2 or
h3's, number of conjunctions, number of images*, average word length, number
of h4's, number of prepositions, number of pronouns, number of interlanguage
links, average syllables per word, number of nominalizations, article age
(based on page id), proportion of questions, average sentence length. :*
Number of images was actually by far the single strongest predictor of any
class, but only for Featured articles. Because it was so good at picking out
featured articles and somewhat good at picking out A and G articles the
classifier was confused in so many cases that the overall contribution of
this feature to classification performance is zero. :* Number of external
links is strongly predictive of Featured articles. :* The B class is highly
distinctive. It has a strong "signature," with high predictive value
assigned to many features. The Featured class is also very distinctive. F, B
and S (Stop/Stub) contain the most information.
:* A is the least distinct class, not being very different from F or G. =
Latent features = The algorithm used for latent analysis, which is an
analysis of the occurence of words in every document with respect to the
link structure of the encyclopedia ("concepts"), is [[Latent Dirichlet
Allocation]]. This part of the analysis was done by CS PhD student Praful
Mangalath. An example of what can be done with the result of this analysis
is that you provide a word (a search query) such as "hippie". You can then
look at the weight of every article for the word hippie. You can pick the
article with the largest weight, and then look at its link network. You can
pick out the articles that this article links to and/or which link to this
article that are also weighted strongly for the word hippie, while also
contributing maximally to this articles "hippieness". We tried this query in
our system (LDA), Google (site:en.wikipedia.org hippie), and the Simple
English Wikipedia's Lucene search engine. The breakdown of articles occuring
in the top ten search results for this word for those engines is: * LDA
only: [[Acid rock]], [[Aldeburgh Festival]], [[Anne Murray]], [[Carl
Radle]], [[Harry Nilsson]], [[Jack Kerouac]], [[Phil Spector]], [[Plastic
Ono Band]], [[Rock and Roll]], [[Salvador Allende]], [[Smothers brothers]],
[[Stanley Kubrick]]. * Google only: [[Glam Rock]], [[South Park]]. * Simple
only: [[African Americans]], [[Charles Manson]], [[Counterculture]], [[Drug
use]], [[Flower Power]], [[Nuclear weapons]], [[Phish]], [[Sexual
liberation]], [[Summer of Love]] * LDA & Google & Simple: [[Hippie]],
[[Human Be-in]], [[Students for a democratic society]], [[Woodstock
festival]] * LDA & Google: [[Psychedelic Pop]] * Google & Simple: [[Lysergic
acid diethylamide]], [[Summer of Love]] ( See the paper for the articles
produced for the keywords philosophy and economics ) = Discussion /
Conclusion = * The results of the latent analysis are totally up to your
perception. But what is interesting is that the LDA features predict the WET
ratings of quality just as well as the surface level features. Both feature
sets (surface and latent) both pull out all almost of the information that
the rating system bears. * The rating system devised by the WET is not
distinctive. You can best tell the difference between, grouped together,
Featured, A and Good articles vs B articles. Featured, A and Good articles
are also quite distinctive (Figure 1). Note that in this study we didn't
look at Start's and Stubs, but in earlier paper we did. :* This is
interesting when compared to this recent entry on the YouTube blog. "Five
Stars Dominate Ratings"
I think a sane, well researched (with actual subjects) rating system
well within the purview of the Usability Initiative. Helping people find and
create good content is what Wikipedia is all about. Having a solid rating
system allows you to reorganized the user interface, the Wikipedia
namespace, and the main namespace around good content and bad content as
needed. If you don't have a solid, information bearing rating system you
don't know what good content really is (really bad content is easy to spot).
:* My Wikimania talk was all about gathering data from people about articles
and using that to train machines to automatically pick out good content. You
ask people questions along dimensions that make sense to people, and give
the machine access to other surface features (such as a statistical measure
of readability, or length) and latent features (such as can be derived from
document word occurence and encyclopedia link structure). I referenced page
262 of Zen and the Art of Motorcycle Maintenance to give an example of the
kind of qualitative features I would ask people. It really depends on what
features end up bearing information, to be tested in "the lab". Each word is
an example dimension of quality: We have "*unity, vividness, authority,
economy, sensitivity, clarity, emphasis, flow, suspense, brilliance,
precision, proportion, depth and so on.*" You then use surface and latent
features to predict these values for all articles. You can also say, when a
person rates this article as high on the x scale, they also mean that it has
has this much of these surface and these latent features.
= References =
- DeHoust, C., Mangalath, P., Mingus., B. (2008). *Improving search in
Wikipedia through quality and concept discovery*. Technical Report.
- Rassbach, L., Mingus., B, Blackford, T. (2007). *Exploring the
feasibility of automatically rating online article quality*. Technical
I have asked and received permission to forward to you all this most
excellent bit of news.
The linguist list, is a most excellent resource for people interested in the
field of linguistics. As I mentioned some time ago they have had a funding
drive and in that funding drive they asked for a certain amount of money in
a given amount of days and they would then have a project on Wikipedia to
learn what needs doing to get better coverage for the field of linguistics.
What you will read in this mail that the total community of linguists are
asked to cooperate. I am really thrilled as it will also get us more
linguists interested in what we do. My hope is that a fraction will be
interested in the languages that they care for and help it become more
relevant. As a member of the "language prevention committee", I love to get
more knowledgeable people involved in our smaller projects. If it means that
we get more requests for more projects we will really feel embarrassed with
all the new projects we will have to approve because of the quality of the
Incubator content and the quality of the linguistic arguments why we should
approve yet another language :)
NB Is this not a really clever way of raising money; give us this much in
this time frame and we will then do this as a bonus...
---------- Forwarded message ----------
From: LINGUIST Network <linguist(a)linguistlist.org>
Date: Jun 18, 2007 6:53 PM
Subject: 18.1831, All: Call for Participation: Wikipedia Volunteers
LINGUIST List: Vol-18-1831. Mon Jun 18 2007. ISSN: 1068 - 4875.
Subject: 18.1831, All: Call for Participation: Wikipedia Volunteers
Moderators: Anthony Aristar, Eastern Michigan U <aristar(a)linguistlist.org>
Helen Aristar-Dry, Eastern Michigan U <hdry(a)linguistlist.org>
Reviews: Laura Welcher, Rosetta Project
The LINGUIST List is funded by Eastern Michigan University,
and donations from subscribers and publishers.
Editor for this issue: Ann Sawyer <sawyer(a)linguistlist.org>
To post to LINGUIST, use our convenient web form at
From: Hannah Morales < hannah(a)linguistlist.org >
Subject: Wikipedia Volunteers
-------------------------Message 1 ----------------------------------
Date: Mon, 18 Jun 2007 12:49:35
From: Hannah Morales < hannah(a)linguistlist.org >
Subject: Wikipedia Volunteers
As you may recall, one of our Fund Drive 2007 campaigns was called the
"Wikipedia Update Vote." We asked our viewers to consider earmarking their
donations to organize an update project on linguistics entries in the
English-language Wikipedia. You can find more background information on this
The speed with which we met our goal, thanks to the interest and generosity
our readers, was a sure sign that the linguistics community was enthusiastic
about the idea. Now that summer is upon us, and some of you may have a bit
leisure time, we are hoping that you will be able to help us get started on
Wikipedia project. The LINGUIST List's role in this project is a purely
organizational one. We will:
*Help, with your input, to identify major gaps in the Wikipedia materials or
pages that need improvement;
*Compile a list of linguistics pages that Wikipedia editors have identified
"in need of attention from an expert on the subject" or " does not cite any
references or sources," etc;
*Send out periodical calls for volunteer contributors on specific topics or
*Provide simple instructions on how to upload your entries into Wikipedia;
*Keep track of our project Wikipedians;
*Keep track of revisions and new entries;
*Work with Wikimedia Foundation to publicize the linguistics community's
We hope you are as enthusiastic about this effort as we are. Just to help us
get started looking at Wikipedia more critically, and to easily identify an
needing improvement, we suggest that you take a look at the List of
Many people are not listed there; others need to have more facts and
added. If you would like to participate in this exciting update effort,
respond by sending an email to LINGUIST Editor Hannah Morales at
hannah(a)linguistlist.org, suggesting what your role might be or which
entries you feel should be updated or added. Some linguists who saw our
on the Internet have already written us with specific suggestions, which we
share with you soon.
This update project will take major time and effort on all our parts. The
result will be a much richer internet resource of information on the breadth
depth of the field of linguistics. Our efforts should also stimulate
students to consider studying linguistics and to educate a wider public on
we do. Please consider participating.
Editor, Wikipedia Update Project
Linguistic Field(s): Not Applicable
LINGUIST List: Vol-18-1831
There is a request for a Wikipedia in Ancient Greek. This request has so far
been denied. A lot of words have been used about it. Many people maintain
their positions and do not for whatever reason consider the arguments of
In my opinion their are a few roadblocks.
- Ancient Greek is an ancient language - the policy does not allow for
- Text in ancient Greek written today about contemporary subjects
require the reconstruction of Ancient Greek.
- it requires the use of existing words for concepts that did
not exist at the time when the language was alive
- neologisms will be needed to describe things that did not
exist at the time when the language was alive
- modern texts will not represent the language as it used to be
- Constructed and by inference reconstructed languages are effectively
We can change the policy if there are sufficient arguments, when we agree on
When a text is written in reconstructed ancient Greek, and when it is
clearly stated that it is NOT the ancient Greek of bygone days, it can be
obvious that it is a great tool to learn skills to read and write ancient
Greek but that it is in itself not Ancient Greek. Ancient Greek as a
language is ancient. I have had a word with people who are involved in the
working group that deals with the ISO-639, I have had a word with someone
from SIL and it is clear that a proposal for a code for "Ancient Greek
reconstructed" will be considered for the ISO-639-3. For the ISO-639-6 a
code is likely to be given because a clear use for this code can be given.
We can apply for a code and as it has a use bigger then Wikipedia alone it
clearly has merit.
With modern texts clearly labelled as distinct from the original language,
it will be obvious that innovations a writers needs for his writing are
This leaves the fact that constructed and reconstructed languages are not
permitted because of the notion that mother tongue users are required. In my
opinion, this has always been only a gesture to those people who are dead
set against any and all constructed languages. In the policies there is
something vague "*it must have a reasonable degree of recognition as
determined by discussion (this requirement is being discussed by the language
subcommittee <http://meta.wikimedia.org/wiki/Language_subcommittee>)."* It
is vague because even though the policy talks about a discussion, it is
killed off immediately by stating "The proposal has a sufficient number of
living native speakers to form a viable community and audience." In my
opinion, this discussion for criteria for the acceptance of constructed or
reconstructed languages has not happened. Proposals for objective criteria
have been ignored.
In essence, to be clear about it:
- We can get a code for reconstructed languages.
- We need to change the policy to allow for reconstructed and
We need to do both in order to move forward.
The proposal for objective criteria for constructed and reconstructed
languages is in a nutshell:
- The language must have an ISO-639-3 code
- We need full WMF localisation from the start
- The language must be sufficiently expressive for writing a modern
- The Incubator project must have sufficiently large articles that
demonstrate both the language and its ability to write about a wide range of
- A sufficiently large group of editors must be part of the Incubator
Let's see what we've got here:
A "Board" that appears answerable only to some god; an "Executive Director"
who answers only to this "Board"; a group of "Moderators" who claim (with a
straight face) that they are "independent", but whose "moderations" are
clearly designed to keep the first two in a favorable light; and, dead last,
you have the people who, not so ironically, create the substance of the
thing that makes the first three possible. This setup sounds achingly
familiar. And, like all similar setups throughout history, is set up to
on 10/20/10 12:44 AM, Virgilio A. P. Machado at vam(a)fct.unl.pt wrote:
> I agree with you. You raised some very good points.
> Virgilio A. P. Machado
> At 03:47 20-10-2010, you wrote:
>> ________________________________ From: Austin
>> Hair <adhair(a)gmail.com> To: Wikimedia Foundation
>> Mailing List <foundation-l(a)lists.wikimedia.org>
>> Sent: Tue, October 19, 2010 12:35:07 PM Subject:
>> Re: [Foundation-l] Greg Kohs and Peter Damian On
>> Mon, Oct 18, 2010 at 6:40 PM, Nathan
>> <nawrich(a)gmail.com> wrote: > If it pleases the
>> moderators, might we know on what basis Greg
>> was > banned and Peter indefinitely muzzled?
>> Greg Kohs was banned for the same reason that
>> he's been on moderation for the better part of
>> the past yearnamely, that he was completely
>> unable tto keep his contributions civil, and
>> caused more flamewars than constructive
>> discussion. Peter Damian is only on moderation,
>> and we'll follow our usual policy of letting
>> through anything that could be considered even
>> marginally acceptable. We really are very
>> liberal about thisotheerwise you wouldn't have
>> heard from Mr. Kohs at all in the past six
>> months. I'm sure that my saying this won't
>> convince anyone who's currently defending him,
>> but nothing about the decision to ban Greg Kohs
>> was retaliatory. I'll also (not for the first
>> time) remind everyone that neither the Wikimedia
>> Foundation Board, nor its staff, nor any chapter
>> or other organizational body has any say in the
>> administration of this list. I hope that clears
>> up all of the questions asked in this thread so
>> far. It is not about defending anyone but about
>> the fact that the "I know bannable when I see
>> it" theory of moderation is unconstructive and
>> leads to dramafests. The next ban is the one
>> that will likely cause a real flame war. I
>> suspect *more* people would be on moderation if
>> any sort of objective criteria were being
>> used. The lack of explanation over this bothers
>> me so much because I suspect that you *can't*
>> explain it. It seems to be the sort of gut-shot
>> that hasn't been thought through. Moderate more
>> people based on real criteria, rather than how
>> you feel about them. Birgitte
>> foundation-l mailing list
>> foundation-l(a)lists.wikimedia.org Unsubscribe:
> foundation-l mailing list
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>From what I have seen about Greg Kohs is that he does have some
interesting points to make, but I do see that he is jumping to
conclusions and does seem to have a biased viewpoint.
People want to make their own decisions and have enough information to
do that. We don't want to have important information deleted away
because it is uncomfortable.
Banning him makes it less likely for him to be heard, and these
interesting points which are worth considering are not heard my many
people : this is depriving people of critical information, that is not
fair to the people involved.
Just look at this article for example, it is quite interesting and
well written, and why should it not be visible to everyone on the
Deleting and banning people who say things that are not comfortable,
that does make you look balanced and trustworthy.
The Wikimedia foundation should be able to stand up to such
accusations without resorting to gagging people, it just gives more
credit to the people being gagged and makes people wonder if there is
any merit in what they say.
This brings up my favorite subject of unneeded deletions versions needed ones.
Of course there is material that should be deleted that is hateful,
Spam etc, lets call that evil content.
But the articles that i wrote and my friends wrote that were deleted
did not fall into that category, they might have been just bad or not
We have had a constant struggle to keep our articles from being
deleted in a manner that we consider unfair. Additionally, the bad
content is lost and falls into the same category as evil content.
Also there should be more transparency on deleted material on the
Wikipedia itself, there is a lot of information that is being deleted
and gone forever without proper process or review.
In my eyes there is a connection between the two topics, the banning
of people and the deleting of information. Both are depriving people
from information that they want and need in an unfair manner.
Instead of articles about obscure events, things, and old places in
Kosovo you have a wikipedia full of the latest information about every
television show, is that what you really want?
I think there should be room for things in places that are not not
notable because they are not part of mainstream pop culture, we also
need to support the underdogs of Wikipedia even if they are not
mainstream, Mr Kohs definitely has something to say and I would like
like to hear it. And the Kosovars have something to say even if the
Serbs don't want to hear it. The Albanians have something to say even
if the Greeks don't want to hear it, etc. There are many cases of
people from Kosovo and Albania driven out of Wikipedia and depriving
the project of important information because they are not able to get
started and the contributions are so far way from the dominating
political viewpoint of the opposite side that they don't even get a
chance to be heard.
We need to make a way for these people to be heard and to moderate the
conflicts better, that will make Wikipedia stronger and more robust.
Should we offer to host citizendium?
Okey get over the instinctive reaction.
Those who have read this week's signpost will be aware that
citizendium is in significant financial difficulties. If not see the
end of the briefly section:
Now I know we haven't exactly had the best of relationships with
citizendium but we are if not distant allies at least interested
observers. Their mission and much of their product at this time
coincides with ours.
We should offer to host citizendium on our servers at no cost for a
period of 1 (one) year offering a level of support equivalent to our
smaller projects. After one year the citizendium community/Editorial
Council is expected to have sorted themselves out to the point where
they can arrange their own hosting. At which point we lock the
database and provide them with the dumps
*It is inline with out mission
*It wouldn't cost very much. Given their traffic levels and database
size the cost to host would probably be lower than some of our more
prolific image uploaders.
*It would be possible to effectively give them instacommons
*Citizendium is an interesting project and gives us a way to learn
what the likely outcome of some alternative approaches would be
*It helps with positioning the WMF as more than just wikipedia
*It prevents the citizendium project from dying which since they have
useful content would be unfortunate
*They may still be on PostgreSQL rather than mysql which could create
issues with compatibility
*Some of their community are people banned from wikipedia
*risk of looking like triumphalism over Larry (can be addressed by
making sure jimbo is in no way involved)
*keeping control of the relationship between the citizendium
community/Editorial Council and the various WMF communities
*Handing the password database back at the end of the year would need
to be done with care.
All in all other than the assuming we can deal with the database issue
I think it is something we should do. The citizendium
community/Editorial Council may well say no but at least we will have
This morning the Wikimedia Foundation had a meeting about migrating to
Google Apps. Google Apps is a Web-based closed source office suite that
includes Gmail and a few other services.
I had a few questions about this migration.
Has the decision to use Google Apps been finalized? If so, who made the
What are the benefits of using Google Apps for the Wikimedia Foundation?
Is there a concern about using closed source software when there are
comparable open source alternatives?
Is there a concern that this will bring Google and the Wikimedia Foundation
closer together? After a $2 million grant, I imagine some people looking in
from the outside have their concerns about a takeover.
Are there concerns about Google's privacy practices? It doesn't seem
particularly wise to hand them all of your e-mail, especially if they
possibly have a business interest.
Any clarifications on this would be great!