This paper (first reference) is the result of a class project I was part of
almost two years ago for CSCI 5417 Information Retrieval Systems. It builds
on a class project I did in CSCI 5832 Natural Language Processing and which
I presented at Wikimania '07. The project was very late as we didn't send
the final paper in until the day before new years. This technical report was
never really announced that I recall so I thought it would be interesting to
look briefly at the results. The goal of this paper was to break articles
down into surface features and latent features and then use those to study
the rating system being used, predict article quality and rank results in a
search engine. We used the [[random forests]] classifier which allowed us to
analyze the contribution of each feature to performance by looking directly
at the weights that were assigned. While the surface analysis was performed
on the whole english wikipedia, the latent analysis was performed on the
simple english wikipedia (it is more expensive to compute). = Surface
features = * Readability measures are the single best predictor of quality
that I have found, as defined by the Wikipedia Editorial Team (WET). The
[[Automated Readability Index]], [[Gunning Fog Index]] and [[Flesch-Kincaid
Grade Level]] were the strongest predictors, followed by length of article
html, number of paragraphs, [[Flesh Reading Ease]], [[Smog Grading]], number
of internal links, [[Laesbarhedsindex Readability Formula]], number of words
and number of references. Weakly predictive were number of to be's, number
of sentences, [[Coleman-Liau Index]], number of templates, PageRank, number
of external links, number of relative links. Not predictive (overall - see
the end of section 2 for the per-rating score breakdown): Number of h2 or
h3's, number of conjunctions, number of images*, average word length, number
of h4's, number of prepositions, number of pronouns, number of interlanguage
links, average syllables per word, number of nominalizations, article age
(based on page id), proportion of questions, average sentence length. :*
Number of images was actually by far the single strongest predictor of any
class, but only for Featured articles. Because it was so good at picking out
featured articles and somewhat good at picking out A and G articles the
classifier was confused in so many cases that the overall contribution of
this feature to classification performance is zero. :* Number of external
links is strongly predictive of Featured articles. :* The B class is highly
distinctive. It has a strong "signature," with high predictive value
assigned to many features. The Featured class is also very distinctive. F, B
and S (Stop/Stub) contain the most information.
:* A is the least distinct class, not being very different from F or G. =
Latent features = The algorithm used for latent analysis, which is an
analysis of the occurence of words in every document with respect to the
link structure of the encyclopedia ("concepts"), is [[Latent Dirichlet
Allocation]]. This part of the analysis was done by CS PhD student Praful
Mangalath. An example of what can be done with the result of this analysis
is that you provide a word (a search query) such as "hippie". You can then
look at the weight of every article for the word hippie. You can pick the
article with the largest weight, and then look at its link network. You can
pick out the articles that this article links to and/or which link to this
article that are also weighted strongly for the word hippie, while also
contributing maximally to this articles "hippieness". We tried this query in
our system (LDA), Google (site:en.wikipedia.org hippie), and the Simple
English Wikipedia's Lucene search engine. The breakdown of articles occuring
in the top ten search results for this word for those engines is: * LDA
only: [[Acid rock]], [[Aldeburgh Festival]], [[Anne Murray]], [[Carl
Radle]], [[Harry Nilsson]], [[Jack Kerouac]], [[Phil Spector]], [[Plastic
Ono Band]], [[Rock and Roll]], [[Salvador Allende]], [[Smothers brothers]],
[[Stanley Kubrick]]. * Google only: [[Glam Rock]], [[South Park]]. * Simple
only: [[African Americans]], [[Charles Manson]], [[Counterculture]], [[Drug
use]], [[Flower Power]], [[Nuclear weapons]], [[Phish]], [[Sexual
liberation]], [[Summer of Love]] * LDA & Google & Simple: [[Hippie]],
[[Human Be-in]], [[Students for a democratic society]], [[Woodstock
festival]] * LDA & Google: [[Psychedelic Pop]] * Google & Simple: [[Lysergic
acid diethylamide]], [[Summer of Love]] ( See the paper for the articles
produced for the keywords philosophy and economics ) = Discussion /
Conclusion = * The results of the latent analysis are totally up to your
perception. But what is interesting is that the LDA features predict the WET
ratings of quality just as well as the surface level features. Both feature
sets (surface and latent) both pull out all almost of the information that
the rating system bears. * The rating system devised by the WET is not
distinctive. You can best tell the difference between, grouped together,
Featured, A and Good articles vs B articles. Featured, A and Good articles
are also quite distinctive (Figure 1). Note that in this study we didn't
look at Start's and Stubs, but in earlier paper we did. :* This is
interesting when compared to this recent entry on the YouTube blog. "Five
Stars Dominate Ratings"
I think a sane, well researched (with actual subjects) rating system
well within the purview of the Usability Initiative. Helping people find and
create good content is what Wikipedia is all about. Having a solid rating
system allows you to reorganized the user interface, the Wikipedia
namespace, and the main namespace around good content and bad content as
needed. If you don't have a solid, information bearing rating system you
don't know what good content really is (really bad content is easy to spot).
:* My Wikimania talk was all about gathering data from people about articles
and using that to train machines to automatically pick out good content. You
ask people questions along dimensions that make sense to people, and give
the machine access to other surface features (such as a statistical measure
of readability, or length) and latent features (such as can be derived from
document word occurence and encyclopedia link structure). I referenced page
262 of Zen and the Art of Motorcycle Maintenance to give an example of the
kind of qualitative features I would ask people. It really depends on what
features end up bearing information, to be tested in "the lab". Each word is
an example dimension of quality: We have "*unity, vividness, authority,
economy, sensitivity, clarity, emphasis, flow, suspense, brilliance,
precision, proportion, depth and so on.*" You then use surface and latent
features to predict these values for all articles. You can also say, when a
person rates this article as high on the x scale, they also mean that it has
has this much of these surface and these latent features.
= References =
- DeHoust, C., Mangalath, P., Mingus., B. (2008). *Improving search in
Wikipedia through quality and concept discovery*. Technical Report.
- Rassbach, L., Mingus., B, Blackford, T. (2007). *Exploring the
feasibility of automatically rating online article quality*. Technical
I have asked and received permission to forward to you all this most
excellent bit of news.
The linguist list, is a most excellent resource for people interested in the
field of linguistics. As I mentioned some time ago they have had a funding
drive and in that funding drive they asked for a certain amount of money in
a given amount of days and they would then have a project on Wikipedia to
learn what needs doing to get better coverage for the field of linguistics.
What you will read in this mail that the total community of linguists are
asked to cooperate. I am really thrilled as it will also get us more
linguists interested in what we do. My hope is that a fraction will be
interested in the languages that they care for and help it become more
relevant. As a member of the "language prevention committee", I love to get
more knowledgeable people involved in our smaller projects. If it means that
we get more requests for more projects we will really feel embarrassed with
all the new projects we will have to approve because of the quality of the
Incubator content and the quality of the linguistic arguments why we should
approve yet another language :)
NB Is this not a really clever way of raising money; give us this much in
this time frame and we will then do this as a bonus...
---------- Forwarded message ----------
From: LINGUIST Network <linguist(a)linguistlist.org>
Date: Jun 18, 2007 6:53 PM
Subject: 18.1831, All: Call for Participation: Wikipedia Volunteers
LINGUIST List: Vol-18-1831. Mon Jun 18 2007. ISSN: 1068 - 4875.
Subject: 18.1831, All: Call for Participation: Wikipedia Volunteers
Moderators: Anthony Aristar, Eastern Michigan U <aristar(a)linguistlist.org>
Helen Aristar-Dry, Eastern Michigan U <hdry(a)linguistlist.org>
Reviews: Laura Welcher, Rosetta Project
The LINGUIST List is funded by Eastern Michigan University,
and donations from subscribers and publishers.
Editor for this issue: Ann Sawyer <sawyer(a)linguistlist.org>
To post to LINGUIST, use our convenient web form at
From: Hannah Morales < hannah(a)linguistlist.org >
Subject: Wikipedia Volunteers
-------------------------Message 1 ----------------------------------
Date: Mon, 18 Jun 2007 12:49:35
From: Hannah Morales < hannah(a)linguistlist.org >
Subject: Wikipedia Volunteers
As you may recall, one of our Fund Drive 2007 campaigns was called the
"Wikipedia Update Vote." We asked our viewers to consider earmarking their
donations to organize an update project on linguistics entries in the
English-language Wikipedia. You can find more background information on this
The speed with which we met our goal, thanks to the interest and generosity
our readers, was a sure sign that the linguistics community was enthusiastic
about the idea. Now that summer is upon us, and some of you may have a bit
leisure time, we are hoping that you will be able to help us get started on
Wikipedia project. The LINGUIST List's role in this project is a purely
organizational one. We will:
*Help, with your input, to identify major gaps in the Wikipedia materials or
pages that need improvement;
*Compile a list of linguistics pages that Wikipedia editors have identified
"in need of attention from an expert on the subject" or " does not cite any
references or sources," etc;
*Send out periodical calls for volunteer contributors on specific topics or
*Provide simple instructions on how to upload your entries into Wikipedia;
*Keep track of our project Wikipedians;
*Keep track of revisions and new entries;
*Work with Wikimedia Foundation to publicize the linguistics community's
We hope you are as enthusiastic about this effort as we are. Just to help us
get started looking at Wikipedia more critically, and to easily identify an
needing improvement, we suggest that you take a look at the List of
Many people are not listed there; others need to have more facts and
added. If you would like to participate in this exciting update effort,
respond by sending an email to LINGUIST Editor Hannah Morales at
hannah(a)linguistlist.org, suggesting what your role might be or which
entries you feel should be updated or added. Some linguists who saw our
on the Internet have already written us with specific suggestions, which we
share with you soon.
This update project will take major time and effort on all our parts. The
result will be a much richer internet resource of information on the breadth
depth of the field of linguistics. Our efforts should also stimulate
students to consider studying linguistics and to educate a wider public on
we do. Please consider participating.
Editor, Wikipedia Update Project
Linguistic Field(s): Not Applicable
LINGUIST List: Vol-18-1831
There is a request for a Wikipedia in Ancient Greek. This request has so far
been denied. A lot of words have been used about it. Many people maintain
their positions and do not for whatever reason consider the arguments of
In my opinion their are a few roadblocks.
- Ancient Greek is an ancient language - the policy does not allow for
- Text in ancient Greek written today about contemporary subjects
require the reconstruction of Ancient Greek.
- it requires the use of existing words for concepts that did
not exist at the time when the language was alive
- neologisms will be needed to describe things that did not
exist at the time when the language was alive
- modern texts will not represent the language as it used to be
- Constructed and by inference reconstructed languages are effectively
We can change the policy if there are sufficient arguments, when we agree on
When a text is written in reconstructed ancient Greek, and when it is
clearly stated that it is NOT the ancient Greek of bygone days, it can be
obvious that it is a great tool to learn skills to read and write ancient
Greek but that it is in itself not Ancient Greek. Ancient Greek as a
language is ancient. I have had a word with people who are involved in the
working group that deals with the ISO-639, I have had a word with someone
from SIL and it is clear that a proposal for a code for "Ancient Greek
reconstructed" will be considered for the ISO-639-3. For the ISO-639-6 a
code is likely to be given because a clear use for this code can be given.
We can apply for a code and as it has a use bigger then Wikipedia alone it
clearly has merit.
With modern texts clearly labelled as distinct from the original language,
it will be obvious that innovations a writers needs for his writing are
This leaves the fact that constructed and reconstructed languages are not
permitted because of the notion that mother tongue users are required. In my
opinion, this has always been only a gesture to those people who are dead
set against any and all constructed languages. In the policies there is
something vague "*it must have a reasonable degree of recognition as
determined by discussion (this requirement is being discussed by the language
subcommittee <http://meta.wikimedia.org/wiki/Language_subcommittee>)."* It
is vague because even though the policy talks about a discussion, it is
killed off immediately by stating "The proposal has a sufficient number of
living native speakers to form a viable community and audience." In my
opinion, this discussion for criteria for the acceptance of constructed or
reconstructed languages has not happened. Proposals for objective criteria
have been ignored.
In essence, to be clear about it:
- We can get a code for reconstructed languages.
- We need to change the policy to allow for reconstructed and
We need to do both in order to move forward.
The proposal for objective criteria for constructed and reconstructed
languages is in a nutshell:
- The language must have an ISO-639-3 code
- We need full WMF localisation from the start
- The language must be sufficiently expressive for writing a modern
- The Incubator project must have sufficiently large articles that
demonstrate both the language and its ability to write about a wide range of
- A sufficiently large group of editors must be part of the Incubator
To avoid further disrupting discussion of interlanguage links and
usability, I'll address the cultural problems separately now. I must
admit, though, that in a discussion where we seemed to have agreed
(rightfully so) that a 1% click rate was significant enough to warrant
serious consideration, I was disappointed that someone could then be so
callous about the need for cultural sensitivity because it most directly
impacts "only 0.55% of the world population" in this case. There is no
meaningful difference in order of magnitude there.
We have significant distortions in the makeup of our community that
affect our culture. There are quite a few groups that are seriously
underrepresented, in part because our culture comes across as unfriendly
to them at best. I talked about African-Americans because it's what was
applicable in that particular situation and I happen to have some
familiarity with the issues. It could just as well have been Australian
Aborigines or another cultural group that has issues with our community.
I'm not as prepared to explain those concerns, but I would welcome
people who can educate us about such problems. It's legitimate to be
wary of things that promote American cultural hegemony, which is another
distortion, but that's not really warranted when the concern relates to
a minority culture in the US.
Some people seem to have gotten hung up on the issue of intent. I didn't
say there was any intent, by the community or individuals, to exclude
certain groups or to create a hostile environment for them. I actually
tried to be as careful as possible not to say that. The point is that
even in the absence of intent, it's possible for our culture to appear
hostile to such groups. We didn't have any intent to be hostile toward
living people, either, yet we've had a long struggle to cope with the
consequences of that impression created by our culture.
Consider the principle of not "biting" newcomers, which relates to a
similar problem. It's not about the intent of the person doing the
"biting", it's about the impact on those who encounter it. We need to be
more welcoming to people, and striving for more cultural awareness is
part of that.
The videos Wikimedia recently produced are available on Youtube,
Facebook and several other sites. Can somebody from the Foundation who
has access to the videos update them and include the subtitles in
several different languages that were provided by Wikimedians on Commons
(see the file description pages of the four videos in
The default video player on Commons does ot support subtitles. They are
only available through the mwEmbed extension. But Videos on Youtube,
Facebook etc. support subtitles natively.
That would make it much easier for non-English Wikimedians to direct the
interested public to the subtitled video in the respective language.
In a message dated 9/20/2010 12:02:43 PM Pacific Daylight Time,
> In my experience
> the problem of humanities in Wikipedia is that the methods and training of
> the 'experts' is so fundamentally different from that of 'Wikipedians'
> by and large have no training at all) that disputes nearly always turn
> ugly. >>
You are again stating the problem as expert vs pedestrian (untrained at
However I again submit that in Wikipedia, you are not an "expert" because
you have a credential, you are an expert because you behave like an expert.
When challenged to provide a source, you cite your source and other readers
find, that it does actually state what you claim it states.
However it seems to me that you'd perhaps like experts to be able to make
unchallengeable claims without sources.
If I'm wrong in that last sentence, then tell me why being an expert is any
different than being any editor at all.
What is the actual procedure by which, when an expert edits, we see
something different than when anyone edits.
I can read a book on the History of the Fourth Crusade, and adds quotes to
our articles on the persons and events, just as well as an expert in that
The problem comes, imho, when "experts" add claims that are unsourced, and
when challenged on them, get uppity about it.
The issue is not uncited claims, or challenged claims. All of our articles
have uncited claims and many have challenged and yet-unfulfilled claims.
The issue is how you are proposing these should be treated differently if the
claim comes from an "expert" versus a "non-expert", isn't it?
So address that.
In a message dated 9/19/2010 9:38:37 AM Pacific Daylight Time,
> "I would strongly urge you to leave the editing of articles
> concerning philosophy and/or philosophers to genuine experts. You simply
> lack the understanding and expertise required to assess whether an edit is
> genuine improvement or an obvious and cowardly sniper attack (as with the
> insertion in question)."
Yes I now see the problem :)
Ivory tower eggheads who think they have the right now, to talk down to
other contributors instead of educating them.
If you, as an academic, cannot explain your edit/article/sentence to a
person who isn't already an expert in your field, then you simply are too
rarified to find a home here at Wikipedia and good riddance, in my opinion.
We don't need *more* huffing and puffing, put-out little boys fingering our
Articles which can only be understood and thus edited by those with IQs
over 165 should probably be consigned to specialist (read read by few)
I'll take your one-sentence snipe as abject agreement :)
Prishtina Hosts Second International Conference on Software Freedom
For the second year running the Kosovar Association for Free/Libre and
Open Source Software
(FLOSSK) and the University of Prishtina are organizing a conference
dedicated to free software - Software Freedom Kosova Conference
This conference follows upon the success of SFK09 held in August last
year attended by more than 500 participants and over 40 national and
international speakers and professionals.
SFK10 will take place on 25 and 26 September starting at 9:00 in the
venues of the
Faculty of Electrical and Computer Engineering of the University of
Prishtina. This year the conference will host several notable hacker
Leon Shiman will speak on the use of FLOSS in public administration;
Rob Savoy of Gnash project will talk about network protocols; Mikel
Maron will speak on the geopolitical use of open maps and Peter Salus,
historian of operating systems, will lecture on the history of
development of GNU/Linux.
Overall, over 20 topics will be discussed, ranging from issues
associated with the Free Encyclopedia Wikipedia, GNU/Linux,
intellectual property licenses, building of communities,
OpenStreetMap, Sugar, and many other topics in the field of free
Topics to be discussed and the quality of lecturers, along with the
success of last year's conference make SFK10 the largest conference of
its kind in Southeast Europe.
The conference is held under the auspices of the Office of the Prime
Minister of the Republic of Kosovo and is supported by a number of
donors from whom it is worth mentioning: the Ministry of Energy and
Mining, Mozilla, Rrota, PC World Albanian and the University of
Prishtina Student Center.
The conference is free to participants during the two days. The
presentations and detailed information on the conference can be found
For Immediate Release
On Sep 22, 3:04 pm, "jamesmikedup...(a)googlemail.com"
> here is a rough translation of the press release :
> Pristina is the conference host software
> ***On 25 and 26 September Pristina will be hosted for the second time
> Freedom Conference Software Kosovo.*** *FLOSS SFK10 Kosovo, organized by
> Faculty of Electrical and Computer Engineering (FIEK) of the University of
> SFK09 held last year was attended by about 500 people who attended about
> lectures of 25 lecturers. This time the conference will be focused: the 24
> lectures will be from Kosovo, region and world.
> The main and guest lecturers at the same time honor of this conference
> renowned as hackers Leon Shiman, Rob Savoye, Mikel Maro and Peter
> Salus.Shiman's Foundation board member who oversees the development of
> system for Linux and BSD - x.org, and the owner of Shiman Associates
> consulting firm. Savoye is the primary developer of Gnash as previously
> developed for Debian, Red Hat and Yahoo. Savoye has been programming since
> 1977. Maron specializes in programming applications based on geography and
> location. Maron is OpenStreeMap Foundation board member, a service
> to Google Maps. Salus is a linguist, computer scientist and historian of
> technology. He worked a professor and dean at several universities. But
> is only the result of the work of the organizing committee which is
> preparing the conference program for almost a year .
> Other topics will provide for all the little: Milot Shala will directly
> demonstrate the Qt Framework development of Nokia's, Martin will tell
> Bekkelund Norwegian practices with open source code (open source) in state
> administration, Baki Goxhaj will talk about WordPress, Marco Fioretti will
> show how programming languages can be used in schools. Other topics are
> Wikipedia, CAD, use of EU's funds in Open Source, Sugar platform for
> children, CMS systems for universities, Android platform, etc..
> The conference will be held on the premises of FIEK-regulation. Free
> Registration begins on Saturday at 9:00 pm and during the two days program
> starts at 10:00. The conference is supported financially by the Office of
> the Prime Minister, Wheel, PC World and New OpenWorld.al.
> For more visit the official website of Kosovo Organization free software
> open - FLOSS Kosovo <http://kosovasoftwarefreedom.org/> . */ telegraph /*
> On Wed, Sep 22, 2010 at 12:57 PM, Luca Paolo Pescatore <
> > wrote:
> > Ehm.... great... should I send to TechCrunch and other EN/US websites in
> > Albanian ? :)
> > Is it possible to have a PR in English ?
> > Luca
> > On Wed, Sep 22, 2010 at 12:54 PM, jamesmikedup...(a)googlemail.com <
> > jamesmikedup...(a)googlemail.com> wrote:
> >> Bernard Writes :
> >> The attached notice has been published today in Gazeta Express and I
> >> sent it to: Telegrafi, Koha Ditore and RTKlive.com. You can use this to
> >> to other media and maybe invite them to come. Also Arianit has written
> >> similar text that we can also use...(
> >> )
> >> As I said yesterday, it would be good if somebody knows people in the
> >> media and talks to them to come.
> >> --
> >> Group homepage:http://groups.google.com/group/foss-al?hl=en
> >> Send messages to: foss-al(a)googlegroups.com
> >> Unsubscribe: foss-al+unsubscribe(a)googlegroups.com<foss-al%2Bunsubscribe(a)googlegroups.com>
> > --
> > Group homepage:http://groups.google.com/group/foss-al?hl=en
> > Send messages to: foss-al(a)googlegroups.com
> > Unsubscribe: foss-al+unsubscribe(a)googlegroups.com<foss-al%2Bunsubscribe(a)googlegroups.com>
> James Michael DuPont
> Member of Free Libre Open Source Software Kosova and Albania flossk.org
You received this message because you are subscribed to the Google Groups
To post to this group, send email to
To unsubscribe from this group, send email to
For more options, visit this group at
James Michael DuPont
Member of Free Libre Open Source Software Kosova and Albania flossk.orgflossal.org
On Wed, Sep 29, 2010 at 2:55 AM, Erik Moeller <erik(a)wikimedia.org> wrote:
> the agenda for Board meetings is set by Sue
> together with the chair of the Board and other Board members.
It is? Isn't that really really odd?