This paper (first reference) is the result of a class project I was part of
almost two years ago for CSCI 5417 Information Retrieval Systems. It builds
on a class project I did in CSCI 5832 Natural Language Processing and which
I presented at Wikimania '07. The project was very late as we didn't send
the final paper in until the day before new years. This technical report was
never really announced that I recall so I thought it would be interesting to
look briefly at the results. The goal of this paper was to break articles
down into surface features and latent features and then use those to study
the rating system being used, predict article quality and rank results in a
search engine. We used the [[random forests]] classifier which allowed us to
analyze the contribution of each feature to performance by looking directly
at the weights that were assigned. While the surface analysis was performed
on the whole english wikipedia, the latent analysis was performed on the
simple english wikipedia (it is more expensive to compute). = Surface
features = * Readability measures are the single best predictor of quality
that I have found, as defined by the Wikipedia Editorial Team (WET). The
[[Automated Readability Index]], [[Gunning Fog Index]] and [[Flesch-Kincaid
Grade Level]] were the strongest predictors, followed by length of article
html, number of paragraphs, [[Flesh Reading Ease]], [[Smog Grading]], number
of internal links, [[Laesbarhedsindex Readability Formula]], number of words
and number of references. Weakly predictive were number of to be's, number
of sentences, [[Coleman-Liau Index]], number of templates, PageRank, number
of external links, number of relative links. Not predictive (overall - see
the end of section 2 for the per-rating score breakdown): Number of h2 or
h3's, number of conjunctions, number of images*, average word length, number
of h4's, number of prepositions, number of pronouns, number of interlanguage
links, average syllables per word, number of nominalizations, article age
(based on page id), proportion of questions, average sentence length. :*
Number of images was actually by far the single strongest predictor of any
class, but only for Featured articles. Because it was so good at picking out
featured articles and somewhat good at picking out A and G articles the
classifier was confused in so many cases that the overall contribution of
this feature to classification performance is zero. :* Number of external
links is strongly predictive of Featured articles. :* The B class is highly
distinctive. It has a strong "signature," with high predictive value
assigned to many features. The Featured class is also very distinctive. F, B
and S (Stop/Stub) contain the most information.
:* A is the least distinct class, not being very different from F or G. =
Latent features = The algorithm used for latent analysis, which is an
analysis of the occurence of words in every document with respect to the
link structure of the encyclopedia ("concepts"), is [[Latent Dirichlet
Allocation]]. This part of the analysis was done by CS PhD student Praful
Mangalath. An example of what can be done with the result of this analysis
is that you provide a word (a search query) such as "hippie". You can then
look at the weight of every article for the word hippie. You can pick the
article with the largest weight, and then look at its link network. You can
pick out the articles that this article links to and/or which link to this
article that are also weighted strongly for the word hippie, while also
contributing maximally to this articles "hippieness". We tried this query in
our system (LDA), Google (site:en.wikipedia.org hippie), and the Simple
English Wikipedia's Lucene search engine. The breakdown of articles occuring
in the top ten search results for this word for those engines is: * LDA
only: [[Acid rock]], [[Aldeburgh Festival]], [[Anne Murray]], [[Carl
Radle]], [[Harry Nilsson]], [[Jack Kerouac]], [[Phil Spector]], [[Plastic
Ono Band]], [[Rock and Roll]], [[Salvador Allende]], [[Smothers brothers]],
[[Stanley Kubrick]]. * Google only: [[Glam Rock]], [[South Park]]. * Simple
only: [[African Americans]], [[Charles Manson]], [[Counterculture]], [[Drug
use]], [[Flower Power]], [[Nuclear weapons]], [[Phish]], [[Sexual
liberation]], [[Summer of Love]] * LDA & Google & Simple: [[Hippie]],
[[Human Be-in]], [[Students for a democratic society]], [[Woodstock
festival]] * LDA & Google: [[Psychedelic Pop]] * Google & Simple: [[Lysergic
acid diethylamide]], [[Summer of Love]] ( See the paper for the articles
produced for the keywords philosophy and economics ) = Discussion /
Conclusion = * The results of the latent analysis are totally up to your
perception. But what is interesting is that the LDA features predict the WET
ratings of quality just as well as the surface level features. Both feature
sets (surface and latent) both pull out all almost of the information that
the rating system bears. * The rating system devised by the WET is not
distinctive. You can best tell the difference between, grouped together,
Featured, A and Good articles vs B articles. Featured, A and Good articles
are also quite distinctive (Figure 1). Note that in this study we didn't
look at Start's and Stubs, but in earlier paper we did. :* This is
interesting when compared to this recent entry on the YouTube blog. "Five
Stars Dominate Ratings"
I think a sane, well researched (with actual subjects) rating system
well within the purview of the Usability Initiative. Helping people find and
create good content is what Wikipedia is all about. Having a solid rating
system allows you to reorganized the user interface, the Wikipedia
namespace, and the main namespace around good content and bad content as
needed. If you don't have a solid, information bearing rating system you
don't know what good content really is (really bad content is easy to spot).
:* My Wikimania talk was all about gathering data from people about articles
and using that to train machines to automatically pick out good content. You
ask people questions along dimensions that make sense to people, and give
the machine access to other surface features (such as a statistical measure
of readability, or length) and latent features (such as can be derived from
document word occurence and encyclopedia link structure). I referenced page
262 of Zen and the Art of Motorcycle Maintenance to give an example of the
kind of qualitative features I would ask people. It really depends on what
features end up bearing information, to be tested in "the lab". Each word is
an example dimension of quality: We have "*unity, vividness, authority,
economy, sensitivity, clarity, emphasis, flow, suspense, brilliance,
precision, proportion, depth and so on.*" You then use surface and latent
features to predict these values for all articles. You can also say, when a
person rates this article as high on the x scale, they also mean that it has
has this much of these surface and these latent features.
= References =
- DeHoust, C., Mangalath, P., Mingus., B. (2008). *Improving search in
Wikipedia through quality and concept discovery*. Technical Report.
- Rassbach, L., Mingus., B, Blackford, T. (2007). *Exploring the
feasibility of automatically rating online article quality*. Technical
I have asked and received permission to forward to you all this most
excellent bit of news.
The linguist list, is a most excellent resource for people interested in the
field of linguistics. As I mentioned some time ago they have had a funding
drive and in that funding drive they asked for a certain amount of money in
a given amount of days and they would then have a project on Wikipedia to
learn what needs doing to get better coverage for the field of linguistics.
What you will read in this mail that the total community of linguists are
asked to cooperate. I am really thrilled as it will also get us more
linguists interested in what we do. My hope is that a fraction will be
interested in the languages that they care for and help it become more
relevant. As a member of the "language prevention committee", I love to get
more knowledgeable people involved in our smaller projects. If it means that
we get more requests for more projects we will really feel embarrassed with
all the new projects we will have to approve because of the quality of the
Incubator content and the quality of the linguistic arguments why we should
approve yet another language :)
NB Is this not a really clever way of raising money; give us this much in
this time frame and we will then do this as a bonus...
---------- Forwarded message ----------
From: LINGUIST Network <linguist(a)linguistlist.org>
Date: Jun 18, 2007 6:53 PM
Subject: 18.1831, All: Call for Participation: Wikipedia Volunteers
LINGUIST List: Vol-18-1831. Mon Jun 18 2007. ISSN: 1068 - 4875.
Subject: 18.1831, All: Call for Participation: Wikipedia Volunteers
Moderators: Anthony Aristar, Eastern Michigan U <aristar(a)linguistlist.org>
Helen Aristar-Dry, Eastern Michigan U <hdry(a)linguistlist.org>
Reviews: Laura Welcher, Rosetta Project
The LINGUIST List is funded by Eastern Michigan University,
and donations from subscribers and publishers.
Editor for this issue: Ann Sawyer <sawyer(a)linguistlist.org>
To post to LINGUIST, use our convenient web form at
From: Hannah Morales < hannah(a)linguistlist.org >
Subject: Wikipedia Volunteers
-------------------------Message 1 ----------------------------------
Date: Mon, 18 Jun 2007 12:49:35
From: Hannah Morales < hannah(a)linguistlist.org >
Subject: Wikipedia Volunteers
As you may recall, one of our Fund Drive 2007 campaigns was called the
"Wikipedia Update Vote." We asked our viewers to consider earmarking their
donations to organize an update project on linguistics entries in the
English-language Wikipedia. You can find more background information on this
The speed with which we met our goal, thanks to the interest and generosity
our readers, was a sure sign that the linguistics community was enthusiastic
about the idea. Now that summer is upon us, and some of you may have a bit
leisure time, we are hoping that you will be able to help us get started on
Wikipedia project. The LINGUIST List's role in this project is a purely
organizational one. We will:
*Help, with your input, to identify major gaps in the Wikipedia materials or
pages that need improvement;
*Compile a list of linguistics pages that Wikipedia editors have identified
"in need of attention from an expert on the subject" or " does not cite any
references or sources," etc;
*Send out periodical calls for volunteer contributors on specific topics or
*Provide simple instructions on how to upload your entries into Wikipedia;
*Keep track of our project Wikipedians;
*Keep track of revisions and new entries;
*Work with Wikimedia Foundation to publicize the linguistics community's
We hope you are as enthusiastic about this effort as we are. Just to help us
get started looking at Wikipedia more critically, and to easily identify an
needing improvement, we suggest that you take a look at the List of
Many people are not listed there; others need to have more facts and
added. If you would like to participate in this exciting update effort,
respond by sending an email to LINGUIST Editor Hannah Morales at
hannah(a)linguistlist.org, suggesting what your role might be or which
entries you feel should be updated or added. Some linguists who saw our
on the Internet have already written us with specific suggestions, which we
share with you soon.
This update project will take major time and effort on all our parts. The
result will be a much richer internet resource of information on the breadth
depth of the field of linguistics. Our efforts should also stimulate
students to consider studying linguistics and to educate a wider public on
we do. Please consider participating.
Editor, Wikipedia Update Project
Linguistic Field(s): Not Applicable
LINGUIST List: Vol-18-1831
There is a request for a Wikipedia in Ancient Greek. This request has so far
been denied. A lot of words have been used about it. Many people maintain
their positions and do not for whatever reason consider the arguments of
In my opinion their are a few roadblocks.
- Ancient Greek is an ancient language - the policy does not allow for
- Text in ancient Greek written today about contemporary subjects
require the reconstruction of Ancient Greek.
- it requires the use of existing words for concepts that did
not exist at the time when the language was alive
- neologisms will be needed to describe things that did not
exist at the time when the language was alive
- modern texts will not represent the language as it used to be
- Constructed and by inference reconstructed languages are effectively
We can change the policy if there are sufficient arguments, when we agree on
When a text is written in reconstructed ancient Greek, and when it is
clearly stated that it is NOT the ancient Greek of bygone days, it can be
obvious that it is a great tool to learn skills to read and write ancient
Greek but that it is in itself not Ancient Greek. Ancient Greek as a
language is ancient. I have had a word with people who are involved in the
working group that deals with the ISO-639, I have had a word with someone
from SIL and it is clear that a proposal for a code for "Ancient Greek
reconstructed" will be considered for the ISO-639-3. For the ISO-639-6 a
code is likely to be given because a clear use for this code can be given.
We can apply for a code and as it has a use bigger then Wikipedia alone it
clearly has merit.
With modern texts clearly labelled as distinct from the original language,
it will be obvious that innovations a writers needs for his writing are
This leaves the fact that constructed and reconstructed languages are not
permitted because of the notion that mother tongue users are required. In my
opinion, this has always been only a gesture to those people who are dead
set against any and all constructed languages. In the policies there is
something vague "*it must have a reasonable degree of recognition as
determined by discussion (this requirement is being discussed by the language
subcommittee <http://meta.wikimedia.org/wiki/Language_subcommittee>)."* It
is vague because even though the policy talks about a discussion, it is
killed off immediately by stating "The proposal has a sufficient number of
living native speakers to form a viable community and audience." In my
opinion, this discussion for criteria for the acceptance of constructed or
reconstructed languages has not happened. Proposals for objective criteria
have been ignored.
In essence, to be clear about it:
- We can get a code for reconstructed languages.
- We need to change the policy to allow for reconstructed and
We need to do both in order to move forward.
The proposal for objective criteria for constructed and reconstructed
languages is in a nutshell:
- The language must have an ISO-639-3 code
- We need full WMF localisation from the start
- The language must be sufficiently expressive for writing a modern
- The Incubator project must have sufficiently large articles that
demonstrate both the language and its ability to write about a wide range of
- A sufficiently large group of editors must be part of the Incubator
Let's see what we've got here:
A "Board" that appears answerable only to some god; an "Executive Director"
who answers only to this "Board"; a group of "Moderators" who claim (with a
straight face) that they are "independent", but whose "moderations" are
clearly designed to keep the first two in a favorable light; and, dead last,
you have the people who, not so ironically, create the substance of the
thing that makes the first three possible. This setup sounds achingly
familiar. And, like all similar setups throughout history, is set up to
on 10/20/10 12:44 AM, Virgilio A. P. Machado at vam(a)fct.unl.pt wrote:
> I agree with you. You raised some very good points.
> Virgilio A. P. Machado
> At 03:47 20-10-2010, you wrote:
>> ________________________________ From: Austin
>> Hair <adhair(a)gmail.com> To: Wikimedia Foundation
>> Mailing List <foundation-l(a)lists.wikimedia.org>
>> Sent: Tue, October 19, 2010 12:35:07 PM Subject:
>> Re: [Foundation-l] Greg Kohs and Peter Damian On
>> Mon, Oct 18, 2010 at 6:40 PM, Nathan
>> <nawrich(a)gmail.com> wrote: > If it pleases the
>> moderators, might we know on what basis Greg
>> was > banned and Peter indefinitely muzzled?
>> Greg Kohs was banned for the same reason that
>> he's been on moderation for the better part of
>> the past yearnamely, that he was completely
>> unable tto keep his contributions civil, and
>> caused more flamewars than constructive
>> discussion. Peter Damian is only on moderation,
>> and we'll follow our usual policy of letting
>> through anything that could be considered even
>> marginally acceptable. We really are very
>> liberal about thisotheerwise you wouldn't have
>> heard from Mr. Kohs at all in the past six
>> months. I'm sure that my saying this won't
>> convince anyone who's currently defending him,
>> but nothing about the decision to ban Greg Kohs
>> was retaliatory. I'll also (not for the first
>> time) remind everyone that neither the Wikimedia
>> Foundation Board, nor its staff, nor any chapter
>> or other organizational body has any say in the
>> administration of this list. I hope that clears
>> up all of the questions asked in this thread so
>> far. It is not about defending anyone but about
>> the fact that the "I know bannable when I see
>> it" theory of moderation is unconstructive and
>> leads to dramafests. The next ban is the one
>> that will likely cause a real flame war. I
>> suspect *more* people would be on moderation if
>> any sort of objective criteria were being
>> used. The lack of explanation over this bothers
>> me so much because I suspect that you *can't*
>> explain it. It seems to be the sort of gut-shot
>> that hasn't been thought through. Moderate more
>> people based on real criteria, rather than how
>> you feel about them. Birgitte
>> foundation-l mailing list
>> foundation-l(a)lists.wikimedia.org Unsubscribe:
> foundation-l mailing list
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>From what I have seen about Greg Kohs is that he does have some
interesting points to make, but I do see that he is jumping to
conclusions and does seem to have a biased viewpoint.
People want to make their own decisions and have enough information to
do that. We don't want to have important information deleted away
because it is uncomfortable.
Banning him makes it less likely for him to be heard, and these
interesting points which are worth considering are not heard my many
people : this is depriving people of critical information, that is not
fair to the people involved.
Just look at this article for example, it is quite interesting and
well written, and why should it not be visible to everyone on the
Deleting and banning people who say things that are not comfortable,
that does make you look balanced and trustworthy.
The Wikimedia foundation should be able to stand up to such
accusations without resorting to gagging people, it just gives more
credit to the people being gagged and makes people wonder if there is
any merit in what they say.
This brings up my favorite subject of unneeded deletions versions needed ones.
Of course there is material that should be deleted that is hateful,
Spam etc, lets call that evil content.
But the articles that i wrote and my friends wrote that were deleted
did not fall into that category, they might have been just bad or not
We have had a constant struggle to keep our articles from being
deleted in a manner that we consider unfair. Additionally, the bad
content is lost and falls into the same category as evil content.
Also there should be more transparency on deleted material on the
Wikipedia itself, there is a lot of information that is being deleted
and gone forever without proper process or review.
In my eyes there is a connection between the two topics, the banning
of people and the deleting of information. Both are depriving people
from information that they want and need in an unfair manner.
Instead of articles about obscure events, things, and old places in
Kosovo you have a wikipedia full of the latest information about every
television show, is that what you really want?
I think there should be room for things in places that are not not
notable because they are not part of mainstream pop culture, we also
need to support the underdogs of Wikipedia even if they are not
mainstream, Mr Kohs definitely has something to say and I would like
like to hear it. And the Kosovars have something to say even if the
Serbs don't want to hear it. The Albanians have something to say even
if the Greeks don't want to hear it, etc. There are many cases of
people from Kosovo and Albania driven out of Wikipedia and depriving
the project of important information because they are not able to get
started and the contributions are so far way from the dominating
political viewpoint of the opposite side that they don't even get a
chance to be heard.
We need to make a way for these people to be heard and to moderate the
conflicts better, that will make Wikipedia stronger and more robust.
[crossposted to foundation-l and wikitech-l]
"There has to be a vision though, of something better. Maybe something
that is an actual wiki, quick and easy, rather than the template
coding hell Wikipedia's turned into." - something Fred Bauder just
said on wikien-l.
Our current markup is one of our biggest barriers to participation.
AIUI, edit rates are about half what they were in 2005, even as our
fame has gone from "popular" through "famous" to "part of the
structure of the world." I submit that this is not a good or healthy
thing in any way and needs fixing.
People who can handle wikitext really just do not understand how
offputting the computer guacamole is to people who can cope with text
they can see.
We know this is a problem; WYSIWYG that works is something that's been
wanted here forever. There are various hideous technical nightmares in
its way, that make this a big and hairy problem, of the sort where the
hair has hair.
However, I submit that it's important enough we need to attack it with
actual resources anyway.
This is just one data point, where a Canadian government office got
*EIGHT TIMES* the participation in their intranet wiki by putting in a
(heavily locally patched) copy of FCKeditor:
"I have to disagree with you given my experience. In one government
department where MediaWiki was installed we saw the active user base
spike from about 1000 users to about 8000 users within a month of having
enabled FCKeditor. FCKeditor definitely has it's warts, but it very
closely matches the experience non-technical people have gotten used to
while using Word or WordPerfect. Leveraging skills people already have
cuts down on training costs and allows them to be productive almost
"Since a plethora of intelligent people with no desire to learn WikiCode
can now add content, the quality of posts has been in line with the
adoption of wiki use by these people. Thus one would say it has gone up.
"In the beginning there were some hard core users that learned WikiCode,
for the most part they have indicated that when the WYSIWYG fails, they
are able to switch to WikiCode mode to address the problem. This usually
occurs with complex table nesting which is something that few of the
users do anyways. Most document layouts are kept simple. Additionally,
we have a multilingual english/french wiki. As a result the browser
spell-check is insufficient for the most part (not to mention it has
issues with WikiCode). To address this a second spellcheck button was
added to the interface so that both english and french spellcheck could
be available within the same interface (via aspell backend)."
So, the payoffs could be ridiculously huge: eight times the number of
smart and knowledgeable people even being able to *fix typos* on
material they care about.
Here are some problems. (Off the top of my head; please do add more,
all you can think of.)
- The problem:
* Fidelity with the existing body of wikitext. No conversion flag day.
The current body exploits every possible edge case in the regular
expression guacamole we call a "parser". Tim said a few years ago that
any solution has to account for the existing body of text.
* Two-way fidelity. Those who know wikitext will demand to keep it and
will bitterly resist any attempt to take it away from them.
* FCKeditor (now CKeditor) in MediaWiki is all but unmaintained.
* There is no specification for wikitext. Well, there almost is -
compiled as C, it runs a bit slower than the existing PHP compiler.
But it's a start!
- Attempting to solve it:
* The best brains around Wikipedia, MediaWiki and WMF have dashed
their foreheads against this problem for at least the past five years
and have got *nowhere*. Tim has a whole section in the SVN repository
for "new parser attempts". Sheer brilliance isn't going to solve this
* Tim doesn't scale. Most of our other technical people don't scale.
*We have no resources and still run on almost nothing*.
($14m might sound like enough money to run a popular website, but for
comparison, I work as a sysadmin at a tiny, tiny publishing company
with more money and staff just in our department than that to do
*almost nothing* compared to what WMF achieves. WMF is an INCREDIBLY
- Other attempts:
* Starting from a clear field makes it ridiculously easy. The
government example quoted above is one. Wikia wrote a good WYSIWYG
that works really nicely on new wikis (I'm speaking here as an
experienced wikitext user who happily fixes random typos on Wikia). Of
course, I noted that we can't start from a clear field - we have an
existing body of wikitext.
So, specification of the problem:
* We need good WYSIWYG. The government example suggests that a simple
word-processor-like interface would be enough to give tremendous
* It needs two-way fidelity with almost all existing wikitext.
* We can't throw away existing wikitext, much as we'd love to.
* It's going to cost money in programming the WYSIWYG.
* It's going to cost money in rationalising existing wikitext so that
the most unfeasible formations can be shunted off to legacy for
* It's going to cost money in usability testing and so on.
* It's going to cost money for all sorts of things I haven't even
thought of yet.
This is a problem that would pay off hugely to solve, and that will
take actual money thrown at it.
How would you attack this problem, given actual resources for grunt work?
Most of the templates in our project, imho are just more clutter.
The number of people who know how to use any particular template, can
probably be counted with a box of marbles. However when others see the
templates, they just shy away, they don't bother to try to learn them.
If we want to make things easier for editors, we should scrape templates
entirely. What they add to the project is not worth, what they detract.
As we prepare to ring in the New Year, we're happy to announce that each
of our current four Wikimedia Foundation Board members up for
reappointment have been unanimously reappointed and will continue in
their positions in 2011. In addition, the Board has completed the first
ever evaluation process, and during that time, we thought hard about
some of the details around the appointment process. After completing the
evaluation process we've voted to adjust the bylaws to extend all
Trustee appointments to two years.
In the past, community-elected and chapter-nominated trustees held
positions for two years before reelection or renomination was necessary;
however, Board-appointed Trustees and the Community Founder Trustee
needed to be reappointed every year. We've decided to amend Article 4,
Section 3 of the Wikimedia Foundation's bylaws so that all Trustees are
equal. We recognize that it's important to all of us that all terms are
equal no matter what "seat" a member holds. This amendment will
officially begin January 1, 2011, however, Stu, Jan-Bart, and Jimmy have
volunteered to take on a one year term. Their terms will expire on
December 31, 2011. Matt's and Bishaka's terms will expire on December
As written in the bylaws, the Board of Trustees consists of three
members selected by the Wikimedia Community, two selected by Wikimedia
chapters, four members through appointment by the Board for specific
expertise, and a Community Founder seat. We currently remain a full
Board with all 10 Trustees currently serving.
Looking back at the last year, we've accomplished a lot as a Board and
as a movement. 2011 will prove to be both exciting and challenging as we
really move into implementing the five-year strategic plan we all worked
so hard together on this year. We're looking forward to bringing
Wikimedia to more people in more places all over the world. Thank you
all for your commitment to keeping Wikimedia going and happy 10 years
and 2011 to you all.
Member of the Board of Trustees
Wikimedia Foundation, Inc.