This paper (first reference) is the result of a class project I was part of
almost two years ago for CSCI 5417 Information Retrieval Systems. It builds
on a class project I did in CSCI 5832 Natural Language Processing and which
I presented at Wikimania '07. The project was very late as we didn't send
the final paper in until the day before new years. This technical report was
never really announced that I recall so I thought it would be interesting to
look briefly at the results. The goal of this paper was to break articles
down into surface features and latent features and then use those to study
the rating system being used, predict article quality and rank results in a
search engine. We used the [[random forests]] classifier which allowed us to
analyze the contribution of each feature to performance by looking directly
at the weights that were assigned. While the surface analysis was performed
on the whole english wikipedia, the latent analysis was performed on the
simple english wikipedia (it is more expensive to compute). = Surface
features = * Readability measures are the single best predictor of quality
that I have found, as defined by the Wikipedia Editorial Team (WET). The
[[Automated Readability Index]], [[Gunning Fog Index]] and [[Flesch-Kincaid
Grade Level]] were the strongest predictors, followed by length of article
html, number of paragraphs, [[Flesh Reading Ease]], [[Smog Grading]], number
of internal links, [[Laesbarhedsindex Readability Formula]], number of words
and number of references. Weakly predictive were number of to be's, number
of sentences, [[Coleman-Liau Index]], number of templates, PageRank, number
of external links, number of relative links. Not predictive (overall - see
the end of section 2 for the per-rating score breakdown): Number of h2 or
h3's, number of conjunctions, number of images*, average word length, number
of h4's, number of prepositions, number of pronouns, number of interlanguage
links, average syllables per word, number of nominalizations, article age
(based on page id), proportion of questions, average sentence length. :*
Number of images was actually by far the single strongest predictor of any
class, but only for Featured articles. Because it was so good at picking out
featured articles and somewhat good at picking out A and G articles the
classifier was confused in so many cases that the overall contribution of
this feature to classification performance is zero. :* Number of external
links is strongly predictive of Featured articles. :* The B class is highly
distinctive. It has a strong "signature," with high predictive value
assigned to many features. The Featured class is also very distinctive. F, B
and S (Stop/Stub) contain the most information.
:* A is the least distinct class, not being very different from F or G. =
Latent features = The algorithm used for latent analysis, which is an
analysis of the occurence of words in every document with respect to the
link structure of the encyclopedia ("concepts"), is [[Latent Dirichlet
Allocation]]. This part of the analysis was done by CS PhD student Praful
Mangalath. An example of what can be done with the result of this analysis
is that you provide a word (a search query) such as "hippie". You can then
look at the weight of every article for the word hippie. You can pick the
article with the largest weight, and then look at its link network. You can
pick out the articles that this article links to and/or which link to this
article that are also weighted strongly for the word hippie, while also
contributing maximally to this articles "hippieness". We tried this query in
our system (LDA), Google (site:en.wikipedia.org hippie), and the Simple
English Wikipedia's Lucene search engine. The breakdown of articles occuring
in the top ten search results for this word for those engines is: * LDA
only: [[Acid rock]], [[Aldeburgh Festival]], [[Anne Murray]], [[Carl
Radle]], [[Harry Nilsson]], [[Jack Kerouac]], [[Phil Spector]], [[Plastic
Ono Band]], [[Rock and Roll]], [[Salvador Allende]], [[Smothers brothers]],
[[Stanley Kubrick]]. * Google only: [[Glam Rock]], [[South Park]]. * Simple
only: [[African Americans]], [[Charles Manson]], [[Counterculture]], [[Drug
use]], [[Flower Power]], [[Nuclear weapons]], [[Phish]], [[Sexual
liberation]], [[Summer of Love]] * LDA & Google & Simple: [[Hippie]],
[[Human Be-in]], [[Students for a democratic society]], [[Woodstock
festival]] * LDA & Google: [[Psychedelic Pop]] * Google & Simple: [[Lysergic
acid diethylamide]], [[Summer of Love]] ( See the paper for the articles
produced for the keywords philosophy and economics ) = Discussion /
Conclusion = * The results of the latent analysis are totally up to your
perception. But what is interesting is that the LDA features predict the WET
ratings of quality just as well as the surface level features. Both feature
sets (surface and latent) both pull out all almost of the information that
the rating system bears. * The rating system devised by the WET is not
distinctive. You can best tell the difference between, grouped together,
Featured, A and Good articles vs B articles. Featured, A and Good articles
are also quite distinctive (Figure 1). Note that in this study we didn't
look at Start's and Stubs, but in earlier paper we did. :* This is
interesting when compared to this recent entry on the YouTube blog. "Five
Stars Dominate Ratings"
http://youtube-global.blogspot.com/2009/09/five-stars-dominate-ratings.html…
I think a sane, well researched (with actual subjects) rating system
is
well within the purview of the Usability Initiative. Helping people find and
create good content is what Wikipedia is all about. Having a solid rating
system allows you to reorganized the user interface, the Wikipedia
namespace, and the main namespace around good content and bad content as
needed. If you don't have a solid, information bearing rating system you
don't know what good content really is (really bad content is easy to spot).
:* My Wikimania talk was all about gathering data from people about articles
and using that to train machines to automatically pick out good content. You
ask people questions along dimensions that make sense to people, and give
the machine access to other surface features (such as a statistical measure
of readability, or length) and latent features (such as can be derived from
document word occurence and encyclopedia link structure). I referenced page
262 of Zen and the Art of Motorcycle Maintenance to give an example of the
kind of qualitative features I would ask people. It really depends on what
features end up bearing information, to be tested in "the lab". Each word is
an example dimension of quality: We have "*unity, vividness, authority,
economy, sensitivity, clarity, emphasis, flow, suspense, brilliance,
precision, proportion, depth and so on.*" You then use surface and latent
features to predict these values for all articles. You can also say, when a
person rates this article as high on the x scale, they also mean that it has
has this much of these surface and these latent features.
= References =
- DeHoust, C., Mangalath, P., Mingus., B. (2008). *Improving search in
Wikipedia through quality and concept discovery*. Technical Report.
PDF<http://grey.colorado.edu/mediawiki/sites/mingus/images/6/68/DeHoustMangalat…>
- Rassbach, L., Mingus., B, Blackford, T. (2007). *Exploring the
feasibility of automatically rating online article quality*. Technical
Report. PDF<http://grey.colorado.edu/mediawiki/sites/mingus/images/d/d3/RassbachPincock…>
Hoi,
I have asked and received permission to forward to you all this most
excellent bit of news.
The linguist list, is a most excellent resource for people interested in the
field of linguistics. As I mentioned some time ago they have had a funding
drive and in that funding drive they asked for a certain amount of money in
a given amount of days and they would then have a project on Wikipedia to
learn what needs doing to get better coverage for the field of linguistics.
What you will read in this mail that the total community of linguists are
asked to cooperate. I am really thrilled as it will also get us more
linguists interested in what we do. My hope is that a fraction will be
interested in the languages that they care for and help it become more
relevant. As a member of the "language prevention committee", I love to get
more knowledgeable people involved in our smaller projects. If it means that
we get more requests for more projects we will really feel embarrassed with
all the new projects we will have to approve because of the quality of the
Incubator content and the quality of the linguistic arguments why we should
approve yet another language :)
NB Is this not a really clever way of raising money; give us this much in
this time frame and we will then do this as a bonus...
Thanks,
GerardM
---------- Forwarded message ----------
From: LINGUIST Network <linguist(a)linguistlist.org>
Date: Jun 18, 2007 6:53 PM
Subject: 18.1831, All: Call for Participation: Wikipedia Volunteers
To: LINGUIST(a)listserv.linguistlist.org
LINGUIST List: Vol-18-1831. Mon Jun 18 2007. ISSN: 1068 - 4875.
Subject: 18.1831, All: Call for Participation: Wikipedia Volunteers
Moderators: Anthony Aristar, Eastern Michigan U <aristar(a)linguistlist.org>
Helen Aristar-Dry, Eastern Michigan U <hdry(a)linguistlist.org>
Reviews: Laura Welcher, Rosetta Project
<reviews(a)linguistlist.org>
Homepage: http://linguistlist.org/
The LINGUIST List is funded by Eastern Michigan University,
and donations from subscribers and publishers.
Editor for this issue: Ann Sawyer <sawyer(a)linguistlist.org>
================================================================
To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html
===========================Directory==============================
1)
Date: 18-Jun-2007
From: Hannah Morales < hannah(a)linguistlist.org >
Subject: Wikipedia Volunteers
-------------------------Message 1 ----------------------------------
Date: Mon, 18 Jun 2007 12:49:35
From: Hannah Morales < hannah(a)linguistlist.org >
Subject: Wikipedia Volunteers
Dear subscribers,
As you may recall, one of our Fund Drive 2007 campaigns was called the
"Wikipedia Update Vote." We asked our viewers to consider earmarking their
donations to organize an update project on linguistics entries in the
English-language Wikipedia. You can find more background information on this
at:
http://linguistlist.org/donation/fund-drive2007/wikipedia/index.cfm.
The speed with which we met our goal, thanks to the interest and generosity
of
our readers, was a sure sign that the linguistics community was enthusiastic
about the idea. Now that summer is upon us, and some of you may have a bit
more
leisure time, we are hoping that you will be able to help us get started on
the
Wikipedia project. The LINGUIST List's role in this project is a purely
organizational one. We will:
*Help, with your input, to identify major gaps in the Wikipedia materials or
pages that need improvement;
*Compile a list of linguistics pages that Wikipedia editors have identified
as
"in need of attention from an expert on the subject" or " does not cite any
references or sources," etc;
*Send out periodical calls for volunteer contributors on specific topics or
articles;
*Provide simple instructions on how to upload your entries into Wikipedia;
*Keep track of our project Wikipedians;
*Keep track of revisions and new entries;
*Work with Wikimedia Foundation to publicize the linguistics community's
efforts.
We hope you are as enthusiastic about this effort as we are. Just to help us
all
get started looking at Wikipedia more critically, and to easily identify an
area
needing improvement, we suggest that you take a look at the List of
Linguists
page at:
http://en.wikipedia.org/wiki/List_of_linguists. M
Many people are not listed there; others need to have more facts and
information
added. If you would like to participate in this exciting update effort,
please
respond by sending an email to LINGUIST Editor Hannah Morales at
hannah(a)linguistlist.org, suggesting what your role might be or which
linguistics
entries you feel should be updated or added. Some linguists who saw our
campaign
on the Internet have already written us with specific suggestions, which we
will
share with you soon.
This update project will take major time and effort on all our parts. The
end
result will be a much richer internet resource of information on the breadth
and
depth of the field of linguistics. Our efforts should also stimulate
prospective
students to consider studying linguistics and to educate a wider public on
what
we do. Please consider participating.
Sincerely,
Hannah Morales
Editor, Wikipedia Update Project
Linguistic Field(s): Not Applicable
-----------------------------------------------------------
LINGUIST List: Vol-18-1831
According to the recent Independent Auditors' Report of the WMF [1], at
some point prior to the end of June 2020, an entity called the "Wikimedia
Knowledge Equity Fund" was established, and $8.723 million was transferred
to it by the WMF, in the form of an unconditional grant. The Fund is
"managed and controlled by Tides Advocacy" (a 501(c)(4) advocacy nonprofit
previously led by the WMF's current General Counsel/Board Secretary, who
served as CEO, Board Secretary, and Treasurer there). Given that a Google
search for "Wikimedia Knowledge Equity Fund" yields zero results prior to
the release of the report, it is clear that the WMF kept this significant
move completely secret for over five months, perhaps over a year. The
Report FAQ additionally emphasizes that the WMF "has no right of return to
the grant funds provided, with the exception of unexpended funds."
The WMF unilaterally and secretly transferred nearly $9 million of movement
funds to an outside organization not recognized by the Affiliations
Committee. No mention of the grant was made in any Board resolutions or
minutes from the relevant time period. The amount was not mentioned in the
public annual plan, which set out rather less than this amount for the
entire grantmaking budget for the year. No application was made through any
of the various Wikimedia grants processes. No further information has been
provided on the administration of this new Fund, or on the text of the
grant agreement.
I am appalled.
-- Yair Rand
[1]
https://upload.wikimedia.org/wikipedia/foundation/f/f7/Wikimedia_Foundation…
As a consequence of the promotion of a Google forms based survey this
week by a WMF representative, a proposal on Wikimedia Commons has been
started to ban the promotion of surveys which rely on third party
sites like Google Forms.[1]
Launched today, but already it appears likely that this proposal will
have a consensus to support. Considering that Commons is one of our
largest Wikimedia projects, there are potential repercussions of
banning the on-wiki promotion of surveys which use Google products or
other closed source third party products like SurveyMonkey.
Feedback is most welcome on the proposal discussion, or on this list
for handling impact, solutions, recommended alternatives that already
exist, or the future role of the WMF to support research and surveys
for the WMF and affiliates by using forking open source software and
self-hosting and self-managing data "locally".
Links
1. https://commons.wikimedia.org/wiki/Commons:Village_pump/Proposals#Use_of_of…
Thanks
Fae
--
faewik(a)gmail.com https://commons.wikimedia.org/wiki/User:Fae
#WearAMask
BEIJING, Feb. 20 (Qiuwen) - NetBlocks, the internet freedom advocacy group,
says Wikipedia was blocked in Myanmar by the authorities.
NetBlocks confirms "all language editions of Wikipedia" were down in Myanmar
starting Thursday morning local time. In a tweet [1], Netblocks said, this is
"part of a widening post-coup internet censorship regime imposed by the
military junta."
Netblocks provided additional information in a picture attached to the tweet,
suggests that they have tested the connectivity of Wikipedia in English and
French, Wikidata, and wikimedia.org, with none of them accessible. This may
intimate that it is highly that the Burmese authorities not only blocked "all
language editions of Wikipedia," but all Wikimedia projects, as a whole. The
picture also suggests that Wikipedia remains inaccessible across four different
internet service providers in Myanmar.
It is likely that the Burmese authorities are blocking Wikimedia projects
using the same tactic seen in China and some other countries, which is by
blocking the main IP address Wikipedia and its sister projects uses. All
Wikimedia projects share the same IP address, which makes it an easy target by
censors to implement a block.
Qiuwen noticed that starting from February 19th, there was a noticeable
increase of edits made from IP addresses that were likely to be used for VPNs
on Burmese Wikipedia, signaling locals may have to use VPNs to get onto
Wikipedia already. On Friday evening local time, an administrator posted a
message on the Village Pump of Burmese Wikipedia, explaining the use of "IP
block exemption," a special MediaWiki flag, similar to rollback and patrol,
allowing users with the flag to edit from VPNs. A similar banner was also set
up, visible on every page of Burmese Wikipedia. The "IP block exemption" flag
is widely issued to users of Chinese Wikipedia, and previously, users of
Turkish Wikipedia, who needed VPNs to access.
Internet blackouts are increasingly common in Myanmar and across the world.
The military shut down the internet before they attempted the coup on February
\1st, and the military authority has blocked or temporarily blocked Facebook
and other social media platforms starting February 3rd. Usages of VPNs
reportedly skyrocketed for locals eager to access blocked websites. NetBlocks
says the authorities have been implementing an "internet curfew," as the
internet shut down during the nights.
This also means Myanmar has joined an increasingly bigger club of countries
that had blocked Wikipedia. Its recent members include Iran, which blocked
Wikipedia for around 24 hours in March 2020, and Venezuela in January 2019.
In countries such as Iran, Internet blackouts also interfered with the
Wikimedia movement, such as Iran's week-long blackout in November 2019 had
delayed the Wikipedia Asian Month edit-a-thon. China, the "permanent member"
of the club, blocked Wikipedia since 2015. It is not clear whether or not the
block on Wikimedia projects will be lifted in the future, similar to what the
Iranian and Turkish authorities had done.
The Wikimedia Foundation has yet to comment on the block. Myanmar Wikimedia
Community User Group, the Wikimedia user group representing Myanmar, has also
yet to comment. Their Facebook page was last updated on January 16th, two
weeks before the military coup.
----
Qiuwen is a news service operated by the Wikimedians of Mainland China user group[2].
Follow us for the latest Wikimedia news in greater China.
CC BY-SA 4.0
Follow us on Telegram: https://t.me/Qiuwen
[1]: https://twitter.com/netblocks/status/1362814793502097409
[2]: https://meta.wikimedia.org/wiki/Wikimedians_of_Mainland_China
Dear Community,
As we announced during December,[1] this year’s WikiForHumanRights campaign
will be arriving for Earth Day this year: April 15-May 15. The theme is
“Right to a Healthy Environment” -- connecting the 20th Birthday “Human”
theme with the global conversations about COVID-19, environmental crisis,
like climate change, and human rights.You can learn more about the theme at
our under-development home page. [2]
The campaign will have two parts: a challenge, like the WikiGap Challenge,
and decentralized digital and in-person events (where COVID-19 risk
assessments allow).
We need your help organizing the decentralized events -- and will be
supporting development of activities through mentoring and rapid grants.
Do you want to be part of organizing? The Campaigns Team at the Wikimedia
Foundation invites you to:
-
the upcoming Wikimedians for Sustainable Development meeting on 21
February at 11:00am UTC [3]
-
Our next virtual office hour on 25 February 2021 at 3:00pm UTC via
zoom.[4]
If you are interested, but can’t attend the meeting: you can also join the
Wikimedians for Sustainable Development communication channels for general
updates [5] or the WikiForHumanRights telegram channel, where we will share
more updates for organizers [6].
Please forward this information to movement organizers who you think will
be interested.
Looking forward to talking with you soon!
Alex Stinson
[1]
https://diff.wikimedia.org/2020/12/10/wikiforhumanrights-2021-help-share-hu…
[2] https://meta.wikimedia.org/wiki/WikiForHumanRights
[3 ]
https://meta.wikimedia.org/wiki/Wikimedians_for_Sustainable_Development/Nex…
[4] See
https://meta.wikimedia.org/wiki/Campaigns/Foundation_Campaigns_Team#Office_…
or https://wikimedia.zoom.us/j/97336292017
[5]
https://meta.wikimedia.org/wiki/Wikimedians_for_Sustainable_Development#Com…
[6] https://t.me/joinchat/Ifp3xRo2fEIk_ecpZLdTEg
--
Alex Stinson
Senior Program Strategist
Wikimedia Foundation
Twitter: @sadads
Learn more about how the communities behind Wikipedia, Wikidata and other
Wikimedia projects partner with cultural heritage organizations:
https://outreach.wikimedia.org/wiki/GLAM
Hello everyone,
This is a reminder for the next open call for the Project Grants
focused on research
and software proposals started today i.e. February 15 with a submission
deadline of March 16, 2021. <https://meta.wikimedia.org/wiki/Grants:Project>.
For this round, we invite you to propose grant applications that fall under
research and software categories. We offer the following resources to help
you plan your project and complete a grant proposal:
*Weekly proposals clinics via Hangouts during the Open Call. Join us for
real-time discussions with Program Officers and select thematic experts and
get live feedback about your Project Grants proposal. We’ll answer
questions and help you make your proposal better.<
https://meta.wikimedia.org/wiki/Grants:Project#Upcoming_events>
*Video tutorials for writing a strong application:<
https://meta.wikimedia.org/wiki/Grants:Project/Tutorial>
*General planning page for Project Grants:<
https://meta.wikimedia.org/wiki/Grants:Project/Plan>
*Program guidelines and criteria:<
https://meta.wikimedia.org/wiki/Grants:Project/Learn>
*Program officers are also available to offer individualized proposal
support upon request.
We are excited to see your grant ideas that will support our community and
make an impact on the future of Wikimedia projects. Put your idea into
motion, and submit your proposal by March 16, 2021! <
https://meta.wikimedia.org/wiki/Grants:Project/Apply>
Please feel free to get in touch with questions about getting started with
your grant application. We have an open call ongoing for Project Grants
Committee - if you are interested in serving on the Committee, please apply
through this link<
https://meta.wikimedia.org/wiki/Grants:Project/Committee/Candidates>. We
are in particular need of software experts; research experts also are
welcome to apply for the role in the committee. Contact us at
projectgrants(a)wikimedia.org if you would like feedback or more information.
Best regards,
Rupika
*Rupika Sharma*
Junior Program Officer
Wikimedia Foundation Grants
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi everyone,
I’m pleased to announce that the Board of Trustees has unanimously approved
a Universal Code of Conduct for the Wikimedia projects and movement.[1] A
Universal Code of Conduct was one of the final recommendations of the
Movement Strategy 2030 process - a multi-year, participatory community
effort to define the future of our movement. The final Universal Code of
Conduct seeks to address disparities in conduct policies across our
hundreds of projects and communities, by creating a binding minimum set of
standards for conduct on the Wikimedia projects that directly address many
of the challenges that contributors face.
The Board is deeply grateful to the communities who have grappled with
these challenging topics. Over the past six months, communities around the
world have participated in conversations and consultations to help build
this code collectively, including local discussions in 19 languages,
surveys, discussions on Meta, and policy drafting by a committee of
volunteers and staff. The document presented to us reflects a significant
investment of time and effort by many of you, and especially by the joint
staff/volunteer committee who created the base draft after reviewing input
collected from community outreach efforts. We also appreciate the
dedication of the Foundation, and its Trust & Safety policy team, in
getting us to this phase.
This was the first phase of our Universal Code of Conduct - from here, the
Trust & Safety team will begin consultations on how best to enforce this
code. In the coming weeks, they will follow-up with more instructions on
how you can participate in discussions around enforcing the new code. Over
the next few months, they will be facilitating consultation discussions in
many local languages, with our affiliates, and on Meta to support a new
volunteer/staff committee in drafting enforcement pathways. For more
information on the process, timeline, and how to participate in this next
phase, please review the Universal Code of Conduct page on Meta.[2]
The Universal Code of Conduct represents an essential step towards our
vision of a world in which all people can participate in the sum of all
knowledge. Together, we have built something extraordinary. Today, we
celebrate this milestone in making our movement a safer space for
contribution for all.
On behalf of the Board of Trustees,
María Sefidari
Board Chair
[1] https://meta.wikimedia.org/wiki/Universal_Code_of_Conduct/Draft_review
[2] https://meta.wikimedia.org/wiki/Universal_Code_of_Conduct
Hi community!
I am happy and proud to announce that the Wikimedia User Group Esperanto
and Free Knowledge (ELiSo) now has a full stack of functionaries. For years
we were working as an non-formal group, but after official incorporation at
the end of 2019, we came into formal organisational mode.
Board:
- Chair: Michal Matúšov (KuboF Hromoslav)
- Vice-Chair: Ivan Camilo Quintero Santacruz
- Board Member: Juan Sebastian Quintero Santacruz
Audit Committee:
- Chair of Audit Committee: Ziko van Dijk
- Michel Castelo Branco
- Yves Nevelsteen
As you can see, between the functionaries are several co-founders, former
Chairs and Vice-Chairs of another Wikimedia affiliations and even authors
of books about Wikipedia. Such expertise is very welcomed in our effort to
be recognized as a Wikimedia Thematic Organisation, and collective member
of the Universal Esperanto Association and World Esperanto Youth
Organization.
Best regads
Michal Matúšov / User:KuboF Hromoslav
Esperanto and Free Knowledge (WUG ELiSo) <https://esperanto.wiki/>
Chair
Hello,
For the 20th anniversary of Wikipedia, the Franco-German television
channel Arte <https://en.wikipedia.org/wiki/Arte> is broadcasting the
documentary Wikipedia and the Democratisation of Knowledge
<https://www.arte.tv/en/videos/093704-000-A/wikipedia-and-the-democratisatio…>.
You can freely watch it until 04/04/2021.
Sorry if the information already passed by here, but a quick research
didn't allow me to find anything on the topic.
Cheers,
psychoslave