This paper (first reference) is the result of a class project I was part of
almost two years ago for CSCI 5417 Information Retrieval Systems. It builds
on a class project I did in CSCI 5832 Natural Language Processing and which
I presented at Wikimania '07. The project was very late as we didn't send
the final paper in until the day before new years. This technical report was
never really announced that I recall so I thought it would be interesting to
look briefly at the results. The goal of this paper was to break articles
down into surface features and latent features and then use those to study
the rating system being used, predict article quality and rank results in a
search engine. We used the [[random forests]] classifier which allowed us to
analyze the contribution of each feature to performance by looking directly
at the weights that were assigned. While the surface analysis was performed
on the whole english wikipedia, the latent analysis was performed on the
simple english wikipedia (it is more expensive to compute). = Surface
features = * Readability measures are the single best predictor of quality
that I have found, as defined by the Wikipedia Editorial Team (WET). The
[[Automated Readability Index]], [[Gunning Fog Index]] and [[Flesch-Kincaid
Grade Level]] were the strongest predictors, followed by length of article
html, number of paragraphs, [[Flesh Reading Ease]], [[Smog Grading]], number
of internal links, [[Laesbarhedsindex Readability Formula]], number of words
and number of references. Weakly predictive were number of to be's, number
of sentences, [[Coleman-Liau Index]], number of templates, PageRank, number
of external links, number of relative links. Not predictive (overall - see
the end of section 2 for the per-rating score breakdown): Number of h2 or
h3's, number of conjunctions, number of images*, average word length, number
of h4's, number of prepositions, number of pronouns, number of interlanguage
links, average syllables per word, number of nominalizations, article age
(based on page id), proportion of questions, average sentence length. :*
Number of images was actually by far the single strongest predictor of any
class, but only for Featured articles. Because it was so good at picking out
featured articles and somewhat good at picking out A and G articles the
classifier was confused in so many cases that the overall contribution of
this feature to classification performance is zero. :* Number of external
links is strongly predictive of Featured articles. :* The B class is highly
distinctive. It has a strong "signature," with high predictive value
assigned to many features. The Featured class is also very distinctive. F, B
and S (Stop/Stub) contain the most information.
:* A is the least distinct class, not being very different from F or G. =
Latent features = The algorithm used for latent analysis, which is an
analysis of the occurence of words in every document with respect to the
link structure of the encyclopedia ("concepts"), is [[Latent Dirichlet
Allocation]]. This part of the analysis was done by CS PhD student Praful
Mangalath. An example of what can be done with the result of this analysis
is that you provide a word (a search query) such as "hippie". You can then
look at the weight of every article for the word hippie. You can pick the
article with the largest weight, and then look at its link network. You can
pick out the articles that this article links to and/or which link to this
article that are also weighted strongly for the word hippie, while also
contributing maximally to this articles "hippieness". We tried this query in
our system (LDA), Google (site:en.wikipedia.org hippie), and the Simple
English Wikipedia's Lucene search engine. The breakdown of articles occuring
in the top ten search results for this word for those engines is: * LDA
only: [[Acid rock]], [[Aldeburgh Festival]], [[Anne Murray]], [[Carl
Radle]], [[Harry Nilsson]], [[Jack Kerouac]], [[Phil Spector]], [[Plastic
Ono Band]], [[Rock and Roll]], [[Salvador Allende]], [[Smothers brothers]],
[[Stanley Kubrick]]. * Google only: [[Glam Rock]], [[South Park]]. * Simple
only: [[African Americans]], [[Charles Manson]], [[Counterculture]], [[Drug
use]], [[Flower Power]], [[Nuclear weapons]], [[Phish]], [[Sexual
liberation]], [[Summer of Love]] * LDA & Google & Simple: [[Hippie]],
[[Human Be-in]], [[Students for a democratic society]], [[Woodstock
festival]] * LDA & Google: [[Psychedelic Pop]] * Google & Simple: [[Lysergic
acid diethylamide]], [[Summer of Love]] ( See the paper for the articles
produced for the keywords philosophy and economics ) = Discussion /
Conclusion = * The results of the latent analysis are totally up to your
perception. But what is interesting is that the LDA features predict the WET
ratings of quality just as well as the surface level features. Both feature
sets (surface and latent) both pull out all almost of the information that
the rating system bears. * The rating system devised by the WET is not
distinctive. You can best tell the difference between, grouped together,
Featured, A and Good articles vs B articles. Featured, A and Good articles
are also quite distinctive (Figure 1). Note that in this study we didn't
look at Start's and Stubs, but in earlier paper we did. :* This is
interesting when compared to this recent entry on the YouTube blog. "Five
Stars Dominate Ratings"
I think a sane, well researched (with actual subjects) rating system
well within the purview of the Usability Initiative. Helping people find and
create good content is what Wikipedia is all about. Having a solid rating
system allows you to reorganized the user interface, the Wikipedia
namespace, and the main namespace around good content and bad content as
needed. If you don't have a solid, information bearing rating system you
don't know what good content really is (really bad content is easy to spot).
:* My Wikimania talk was all about gathering data from people about articles
and using that to train machines to automatically pick out good content. You
ask people questions along dimensions that make sense to people, and give
the machine access to other surface features (such as a statistical measure
of readability, or length) and latent features (such as can be derived from
document word occurence and encyclopedia link structure). I referenced page
262 of Zen and the Art of Motorcycle Maintenance to give an example of the
kind of qualitative features I would ask people. It really depends on what
features end up bearing information, to be tested in "the lab". Each word is
an example dimension of quality: We have "*unity, vividness, authority,
economy, sensitivity, clarity, emphasis, flow, suspense, brilliance,
precision, proportion, depth and so on.*" You then use surface and latent
features to predict these values for all articles. You can also say, when a
person rates this article as high on the x scale, they also mean that it has
has this much of these surface and these latent features.
= References =
- DeHoust, C., Mangalath, P., Mingus., B. (2008). *Improving search in
Wikipedia through quality and concept discovery*. Technical Report.
- Rassbach, L., Mingus., B, Blackford, T. (2007). *Exploring the
feasibility of automatically rating online article quality*. Technical
I have asked and received permission to forward to you all this most
excellent bit of news.
The linguist list, is a most excellent resource for people interested in the
field of linguistics. As I mentioned some time ago they have had a funding
drive and in that funding drive they asked for a certain amount of money in
a given amount of days and they would then have a project on Wikipedia to
learn what needs doing to get better coverage for the field of linguistics.
What you will read in this mail that the total community of linguists are
asked to cooperate. I am really thrilled as it will also get us more
linguists interested in what we do. My hope is that a fraction will be
interested in the languages that they care for and help it become more
relevant. As a member of the "language prevention committee", I love to get
more knowledgeable people involved in our smaller projects. If it means that
we get more requests for more projects we will really feel embarrassed with
all the new projects we will have to approve because of the quality of the
Incubator content and the quality of the linguistic arguments why we should
approve yet another language :)
NB Is this not a really clever way of raising money; give us this much in
this time frame and we will then do this as a bonus...
---------- Forwarded message ----------
From: LINGUIST Network <linguist(a)linguistlist.org>
Date: Jun 18, 2007 6:53 PM
Subject: 18.1831, All: Call for Participation: Wikipedia Volunteers
LINGUIST List: Vol-18-1831. Mon Jun 18 2007. ISSN: 1068 - 4875.
Subject: 18.1831, All: Call for Participation: Wikipedia Volunteers
Moderators: Anthony Aristar, Eastern Michigan U <aristar(a)linguistlist.org>
Helen Aristar-Dry, Eastern Michigan U <hdry(a)linguistlist.org>
Reviews: Laura Welcher, Rosetta Project
The LINGUIST List is funded by Eastern Michigan University,
and donations from subscribers and publishers.
Editor for this issue: Ann Sawyer <sawyer(a)linguistlist.org>
To post to LINGUIST, use our convenient web form at
From: Hannah Morales < hannah(a)linguistlist.org >
Subject: Wikipedia Volunteers
-------------------------Message 1 ----------------------------------
Date: Mon, 18 Jun 2007 12:49:35
From: Hannah Morales < hannah(a)linguistlist.org >
Subject: Wikipedia Volunteers
As you may recall, one of our Fund Drive 2007 campaigns was called the
"Wikipedia Update Vote." We asked our viewers to consider earmarking their
donations to organize an update project on linguistics entries in the
English-language Wikipedia. You can find more background information on this
The speed with which we met our goal, thanks to the interest and generosity
our readers, was a sure sign that the linguistics community was enthusiastic
about the idea. Now that summer is upon us, and some of you may have a bit
leisure time, we are hoping that you will be able to help us get started on
Wikipedia project. The LINGUIST List's role in this project is a purely
organizational one. We will:
*Help, with your input, to identify major gaps in the Wikipedia materials or
pages that need improvement;
*Compile a list of linguistics pages that Wikipedia editors have identified
"in need of attention from an expert on the subject" or " does not cite any
references or sources," etc;
*Send out periodical calls for volunteer contributors on specific topics or
*Provide simple instructions on how to upload your entries into Wikipedia;
*Keep track of our project Wikipedians;
*Keep track of revisions and new entries;
*Work with Wikimedia Foundation to publicize the linguistics community's
We hope you are as enthusiastic about this effort as we are. Just to help us
get started looking at Wikipedia more critically, and to easily identify an
needing improvement, we suggest that you take a look at the List of
Many people are not listed there; others need to have more facts and
added. If you would like to participate in this exciting update effort,
respond by sending an email to LINGUIST Editor Hannah Morales at
hannah(a)linguistlist.org, suggesting what your role might be or which
entries you feel should be updated or added. Some linguists who saw our
on the Internet have already written us with specific suggestions, which we
share with you soon.
This update project will take major time and effort on all our parts. The
result will be a much richer internet resource of information on the breadth
depth of the field of linguistics. Our efforts should also stimulate
students to consider studying linguistics and to educate a wider public on
we do. Please consider participating.
Editor, Wikipedia Update Project
Linguistic Field(s): Not Applicable
LINGUIST List: Vol-18-1831
I was the [[:m:User:555]], mainly active on the last years of my volunteers
actions on Wikimedia Commons and Wikisource. I've left the Wikimedia
projects mainly because the lack of energy from my side to keep trying to
get free time to work in projects fully neglected by the Wikimedia staff,
developers team and some volunteers in the core of the Foundaction acts.
A friend told me about the http://labs.wikimedia.beta.wmflabs.org/ . I've
surprise! no Wikisource wikis with blue color links! I asked myself
random things about the [[bug:21653]] lasted for 26 months until gets
PARTIALLY fixed and decided to check some 'Recent changes' pages and found
Come on guys! What is the point to run a bot spamming on all wikis if the
tests are only to the Wikipedias? Attempt of a 'politically correct' action
to these worse guys from others projects get's 'socially included'? Like in
the real life, those worse guys aren't in need of assistencialism 
Well, I don't expect any change on the Wikipediocentric actions in short,
medium or long time (in fact the Foundation and some local chapters are
trying to make things for the Wikimedia Commons project, but only because
that project is the central media source for Wikipedias), this was only a
Despite my apparently hatred on this message, I really hope that the 3-4
extensions only enabled on Wikisources wikis don't get's any aditional bugs
than the current ones in the new version of MediaWiki in the same intensity
that your guys hopes that focusing in a project that only describes the
knowledge in an encyclopedic way fully meets the
 - wow, a concept from social sciences yet not defined neither on
en.wikpedia or en.wiktionary? O_O
As on all of my previous messages, sorry for my limited English skills.
Am 13.04.2012 um 13:01 schrieb wikimedia-l-request(a)lists.wikimedia.org:
> please find below the WMF report for March 2012, in plain text.
Thanks for publishing the new report.
> Since a few months, we have been publishing a separate "Highlights"
> summary. Please consider helping non-English-language communities to
> stay updated, by providing a translation:
> Many thanks to those who translated last month's "Highlights" into
> Danish, German, Spanish, French, Italian, Japanese, Dutch and
> (partially) Arabic.
We have discussed this on the German Chapter's list recently. Most of those taking part in the discussion opined the Wikimedia Foundation provide translations of its documents into the most important languages. We touched upon the subject as WCA announced it will publish its reports in several languages. Translations should not be left to the community. It is not up to the community to get news from the Foundation, but it is rather up to the Foundation to get its message across to the community. Please note that only a minority of Wikipedians are able to understand your documents published in English. I would be quite grateful if we please could change this.
[ Please excuse me if the subject has already been beaten to
death here; I am not a regular visitor to this mailing list
I tried to search for this stuff here & on strategywiki, but
feel free to point me to the archives! ]
I researched recently some material related to a recent catastrophic
event in Polish railway history and I found out that volunteers
who traditionally dealt with railway matters on Polish Wikipedia
have virtually disappeared.
I remember that community being strong few years ago, and now we
found out that even some basic information about infrastructure is
Few people who still maintain that stuff on the Polish Wikipedia
showed me that at least two other MediaWiki-based projects have been
started to fill the gap:  The latter greets you even with a very
nice shot of *the* railway junction that was instrumental in a recent
One of the projects got started by experienced Wikipedia
editors. They still copy some of their content to the Polish
Wikipedia, but only after it matures; I asked them about the
reasons to go outside of the Wikipedia and they said:
* They have to do lots of original research; it is impossible
to follow development of the railway infrastructure and
operations using only high quality published sources;
* They got bitten a bit by the "notability" discussions in their
field; they want to document every track, every junction
and every locomotive and they are tired of discussing
how "notable" a particular piece of railway equipment
I would have said it's just a single case, but I've seen
some successful web portals being launched by people interested
in history; what is different from many history research and
fan pages is that I've also seen some active members of Wikipedia
community becoming more and more active on those independent sites.
It might be that (unproven theory) really valuable authors
are living on a verge of original research; at some point
they might prefer to turn over to indepedent sites.
There may be other factors too: smaller, friendlier community;
possibility to start anew and so on.
As few of those sites are using MediaWiki software I started
to call them "pre-wikis". Some of them might become a sort of
a "waiting rooms" for the content to be published
on "mature" Wikipedia. To me, analogy to the Wikipedia-Nupedia
story is striking.
What's interesting is that people are not afraid to use
MediaWiki *again* (with all its well-known deficiencies).
In general, I think this is nothing new. There are thousands
of fan wikis on places like Wikia, where certainly some
contributors copy over some mature content to Wikipedia,
should licensing allow that.
But maybe there is some trend that could probably be
better researched, and here are my questions to you:
(1) Do you see similar trend in your respective communities
(preferably not only English-speaking ones)?
(2) Is there a legitimate need for multi-tiered
development of the knowledge-related content (test
wikis, "pre-wikis", sighted revisions) or shall we pursue
"flat development space" ideal?
(3) Assuming we find the abovemetioned trend to be
generally a good thing, shouldn't we try to research
some methodologies to find out whether there is sizeable
effort supporting our goals outside of the core Wikimedia
(4) Assuming we don't like what's going on, shouldn't
we revisit some of Wikipedia core values (like "no
original research", but not only) and try to address
the issue there?
(5) Has Wikipedia as a "product" achieved some
maturity in a way that the real growth and innovation needs
to go somewhere else, as no product/project lasts forever?
Maybe it's something around the question that Kim Bruning
asked on strategywiki  and also :
"we need to find some way to infuse new life
into wikis that are coming to the end of the
WikiLifeCycle. Wiki-communities can, do and will
blow up, and we need to learn how to prevent it,
or have plans on what to do and how to pick up the
User:Saper from plwiki
The voting could be carried out with the global event. Vote eligibility
could be participating in the events for example.
Of course not every country will have an event so not sure if this approach
is a good one.
-- とある白い猫 (To Aru Shiroi Neko)
2012/4/18 Tadija Mileti? <atnimnjin(a)hotmail.com>
> Hmm, and maybe something where we can invite more people to Wikipedia?
> Billboards with :
> Join decade of knowledge. Participate! Write new article!
> Or something similar. I am also against POT-DEC, poor thing for big global
> event such as this.
> *From:* Gnangarra <gnangarra(a)gmail.com>
> *Sent:* Wednesday, April 18, 2012 9:45 AM
> *To:* Wikimedia Commons Discussion List <commons-l(a)lists.wikimedia.org>
> *Cc:* smolensk(a)eunet.rs ; Wikimedia Mailing List<wikimedia-l(a)lists.wikimedia.org>
> *Subject:* Re: [Commons-l] [Wikimedia-l] [Commons-POTY-l] 10th
> anniversary of Wikimedia Commons
> Why not a world wide Wikitakes or a Photowalk day that way everyone
> everywhere can participate in it, no need for big off Commons organisation
> 2012/4/18 とある白い猫 <to.aru.shiroi.neko(a)gmail.com>
>> I do not think we want to select POT-DEC (lets not call it POTD which is
>> something else :) ) from older POTYs since we don't have a large number to
>> choose from. Also, it would be very boring to re-nominate the same winner
>> again. If anything existing POTY winners perhaps should be disqualified for
>> this reason.
>> I am not too sure about the procedure would be best to be honest. I hope
>> this discussion would determine that very aspect. :)
>> US GLAM is appealing but we do want something global. Certainly US GLAM
>> partnerships should be part of it but they should not be all of it.
>> WikiLoves Monuments was a good precursor to this kind of activity. Perhaps
>> a kind of "lessons learned" assessment may be useful while working on this.
>> -- とある白い猫 (To Aru Shiroi Neko)
>> On Wed, Apr 18, 2012 at 07:31, Nikola Smolenski <smolensk(a)eunet.rs>wrote:
>>> > > 2012/4/17 とある白い猫 <to.aru.shiroi.neko(a)gmail.com>:
>>> > > > We already have POTY as an annual event so perhaps a "decade"
>>> event could
>>> > > > be something interesting to consider.
>>> The obvious: select POTD from all the POTYs :)
>>> Wikimedia-l mailing list
>>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>> Commons-l mailing list
> Photo Gallery: http://gnangarra.redbubble.com
> Gn. Blogg: http://gnangarra.wordpress.com
> Commons-l mailing list
> Commons-l mailing list