Pursuant to prior discussions about the need for a research
policy on Wikipedia, WikiProject Research is drafting a
policy regarding the recruitment of Wikipedia users to
participate in studies.
At this time, we have a proposed policy, and an accompanying
group that would facilitate recruitment of subjects in much
the same way that the Bot Approvals Group approves bots.
The policy proposal can be found at:
http://en.wikipedia.org/wiki/Wikipedia:Research
The Subject Recruitment Approvals Group mentioned in the proposal
is being described at:
http://en.wikipedia.org/wiki/Wikipedia:Subject_Recruitment_Approvals_Group
Before we move forward with seeking approval from the Wikipedia
community, we would like additional input about the proposal,
and would welcome additional help improving it.
Also, please consider participating in WikiProject Research at:
http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Research
--
Bryan Song
GroupLens Research
University of Minnesota
Some motivation for a proper WikiCite project. --sj
=============== Begin forwarded message ==================
"How citation distortions create unfounded authority: analysis of a
citation network"
http://www.bmj.com/cgi/content/full/339/jul20_3/b2680
Abstract:
Objective -To understand belief in a specific scientific claim by
studying the pattern of citations among papers stating it.
Design - A complete citation network was constructed from all PubMed
indexed English literature papers addressing the belief that \u03b2
amyloid, a protein accumulated in the brain in Alzheimer\u2019s
disease, is produced by and injures skeletal muscle of patients with
inclusion body myositis. Social network theory and graph theory were
used to analyse this network.
Main outcome measures - Citation bias, amplification, and invention,
and their effects on determining authority.
Results:
The network contained 242 papers and 675 citations addressing the
belief, with 220 553 citation paths supporting it. Unfounded authority
was established by citation bias against papers that refuted or
weakened the belief; amplification, the marked expansion of the belief
system by papers presenting no data addressing it; and forms of
invention such as the conversion of hypothesis into fact through
citation alone. Extension of this network into text within grants
funded by the National Institutes of Health
and obtained through the Freedom of Information Act showed the same
phenomena present and sometimes used to justify requests for funding.
Conclusion:
Citation is both an impartial scholarly method and a powerful form of
social communication. Through distortions in its social use that
include bias, amplification, and invention, citation can be used to
generate
information cascades resulting in unfounded authority of claims.
Construction and analysis of a claim specific citation network may
clarify the nature of a published belief system and expose distorted
methods of social citation.
--
Samuel Klein identi.ca:sj w:user:sj
--
Samuel Klein identi.ca:sj w:user:sj
Hello,
I wanted to introduce myself. I am Amy Roth, the new Research Analyst
for the Public Policy Initiative
(http://outreach.wikimedia.org/wiki/Public_Policy_Initiative) at
Wikimedia. I am responsible for evaluating improvement in quality of
U.S. Public Policy articles over the course of the project.
I know there is a lot of expertise out there, and hopefully, some of you
on this list will have an interest in the project and can provide some
feedback or support for the research. I just introduced an article
quality metric I hope to use for this project and would appreciate any
thoughts and ideas (preferably constructive) that experienced Wikipedia
data users may have.
http://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_United_States_Publi…
cheers, Amy
--
Amy Roth
Research Analyst (Public Policy Initiative)
Wikimedia Foundation
415-839-6885 x671
aroth(a)wikimedia.org
Language Resources and Evaluation Journal
Special Issue on "Collaboratively Constructed Language Resources"
http://www.ukp.tu-darmstadt.de/scientific-community/special-issue-language-…
KEYWORDS
Wikipedia, Wiktionary, Mechanical Turk, Games with a Purpose,
Folksonomies, Twitter, Social Networks
INTRODUCTION
In recent years, online resources collaboratively constructed by
ordinary users on the Web have considerably influenced the
language resources community. They have been successfully used for
example as a substitute for conventional language resources and as
semantically structured corpora. Particularly, knowledge acquisition
bottlenecks and coverage problems pertinent to conventional language
resources can be overcome by collaboratively constructed resources.
The resource that has gained the greatest popularity in this respect
so far is Wikipedia. However, other promising resources were recently
discovered, such as folksonomies, Twitter, the Wiki dictionary
Wiktionary, social Q&A sites like WikiAnswers, approaches based on
Mechanical Turk, or game-based approaches.
The benefits of using collaboratively constructed resources come along
with new challenges, such as the interoperability with existing
resources, or the quality of the extracted lexical semantic knowledge.
Interoperability between resources is crucial as no single resource
provides perfect coverage. The quality of collaboratively constructed
resources is a fundamental issue, as they often lack editorial control
or contain incomplete entries. These challenges actually present a
chance for natural language processing methods to improve the quality
of collaboratively constructed resources. Researchers have therefore
proposed techniques for link prediction or information extraction that
can be used to guide the "crowds" in constructing resources of better
quality.
TOPICS OF INTEREST
Specific topics include but are not limited to:
- Analysis of collaboratively constructed resources, such as
wiki-based resources, folksonomies, Twitter, or social networks;
- Using special features of collaboratively constructed resources
to create novel resource types, for example revision-based corpora,
simplified versions of resources, etc.;
- Analyzing the structure of collaboratively constructed resources
related to their use in computational linguistics and language
technology;
- Interoperability of collaboratively constructed resources with
conventional language resources and between themselves;
- Mining social and collaborative content for constructing structured
language resources (e.g. lexical semantic resources) and the
corresponding tools;
- Mining multilingual information from collaboratively constructed
resources;
- Game-based approaches to resource creation;
- Mechanical Turk for building language resources;
- Quality and reliability of collaboratively constructed language
resources.
We would also like to welcome papers outlining the challenges related
to using collaboratively constructed resources in computational
linguistics and language technology, and spanning the
cross-disciplinary boundaries to discourse analysis, social network
analysis, and artificial intelligence.
IMPORTANT DATES
Submission deadline: Jul 1, 2010
Preliminary decisions: Oct 1, 2010
Submission of revised articles: Nov 1, 2010
Final versions due: Feb 1, 2011
SUBMISSION GUIDELINES
Submissions should be not exceed 30 pages, must be in English, and
follow the submission guidelines on the Language Resources and
Evaluation Web site
http://www.springer.com/education/linguistics/computational+linguistics/jou…
Submissions will be reviewed according to the standards of the LRE
journal. Papers should not have been submitted or published elsewhere
but may be substantially extended or refined versions of conference
papers.
Substantially extended and revised versions of papers accepted at
previous workshops concerned with collaboratively constructed
semantic resources, e.g. the ACL 2009 workshop on "Collaboratively
Constructed Semantic Resources" or the forthcoming COLING 2010
workshop on the same topic are encouraged.
http://www.ukp.tu-darmstadt.de/scientific-community/acl-ijcnlp-2009-worksho…http://www.ukp.tu-darmstadt.de/scientific-community/coling-2010-workshop/
Authors are encouraged to send a brief email to Torsten Zesch
(lastname (at) tk.informatik.tu-darmstadt.de)
indicating their intention to submit an article as soon as possible,
including their contact information and the topic they intend to
address in their submissions.
To submit papers:
- Go to http://www.editorialmanager.com/chum/
- Register and login as an author
- Select "SI: Collaboratively Constructed Semantic" as Paper Type
- Follow the instructions on the screen
GUEST EDITORS
Iryna Gurevych and Torsten Zesch
UKP Lab
Technische Universität Darmstadt
http://www.ukp.tu-darmstadt.de
PRELIMINARY GUEST EDITORIAL BOARD
Further responses are pending and will be announced shortly.
Anette Frank Heidelberg University
Christiane Fellbaum Princeton University
Delphine Bernhard LIMSI-CNRS, Orsay, France
Diana McCarthy Lexical Computing Ltd
Graeme Hirst University of Toronto
Gregory Grefenstette Exalead, Paris, France
György Szarvas Technische Universität Darmstadt
Laurent Romary INRIA, France
Lothar Lemnitzer BBAW, Berlin, Germany
Massimo Poesio University of Essex
Piek Vossen University Amsterdam, Netherlands
Rada Mihalcea University of North Texas
Saif Mohammad National Research Council Canada
ABOUT THE JOURNAL
Language Resources and Evaluation is the first publication devoted to
the acquisition, creation, annotation, and use of language resources,
together with methods for evaluation of resources, technologies, and
applications.
On Thu, Jun 3, 2010 at 4:14 PM, Reid Priedhorsky <reid(a)umn.edu> wrote:
> Brian J Mingus wrote:
> > ---------- Forwarded message ----------
> > From: Brian <Brian.Mingus(a)colorado.edu>
> > Date: Wed, Jun 2, 2010 at 10:46 PM
> > Subject: Re: [Wiki-research-l] Quality and pageviews
> > To: Liam Wyatt <liamwyatt(a)gmail.com>
> >
> >
> > Interestingly, the result is negative. The correlation coefficient
> between
> > 2500 featured articles and 2500 random articles is .18 which is very low.
> I
> > also trained a linear classifier to predict the quality of an article
> based
> > on the number of page views and it was no better than chance.
>
> That reminds me of an incidental finding from our 2007 work: we wanted
> to use article edit rate to predict view rate, but there was no
> correlation between the two.
>
> Reid
That is an interesting negative finding as well. Just so this thread doesn't
go without some positive results, here is a table from one of my technical
reports on some features that *do* correlate with quality. If the number is
greater than zero it correlates with quality, if it is 0 it does not
correlate, and if it is less than 0 it is negatively correlated with
quality. The scale of the numbers is meaningless and not interpretable,
although the relative magnitude is important. These are just the relative
performance of each feature for each class, as extracted from the weights of
a random forests classifier.
http://grey.colorado.edu/mediawiki/sites/mingus/images/1/1e/DeHoustMangalat…
Summary (features in order of predictive ability):
- *Featured* articles are *correlated* with Number of images, Number of
external links, Automated Readability Index, Number of references, Number of
internal links, Length of article HTML, Gunning Fog Index, Flesch-Kincaid
Grade Level, Lesbarhedsindex Readability Formula, Number of words, Number of
to be's, Number of sentences
- Note that featured articles are easy to predict.
- *A* articles are *correlated* with Number of references, PageRank,
Number of external links, Number of images, Article age (page-id).
- Note that A articles are extremely hard to predict. All of the above
A predictors are weaker than all of the featured predictors. This class
should be merged with another quality class.
- *G *articles are *correlated* with Number of external links, Number of
templates, Number of references, Automated Readability Index, Flesh-Kincaid
Grade Level
- *G* articles are *negatively correlated* with Length of article HTML,
Flesch Reading Ease, Smog Grading
- Note that G articles are extremely hard to predict and should be
merged with another quality class.
- *B* articles are *correlated* with Automated Readability Index,
Flesch-Kincaid Grade Level, Laesbarhedsindex Readability Formula, Gunning
Fog Index, Length of Article HTML, Number of paragraphs, Flesh Reading Ease,
Smog Grading, Number of internal links, Number of words, Number of
references, Number of to be's, Number of sentences, Coleman-Liau Index,
Number of templates, PageRank, Number of external links, Number of relative
links, Number of <h3>s, Number of interlanguage links
- Note that B articles are very easy to predict.
- *Start/Stub* were left out of this analysis because they are so easy to
predict based on a lack of pretty much any useful information.
On Sat, Jun 5, 2010 at 12:16 AM, Brian J Mingus
<Brian.Mingus(a)colorado.edu> wrote:
>...
>
> That is an interesting negative finding as well. Just so this thread doesn't
> go without some positive results, here is a table from one of my technical
> reports on some features that do correlate with quality. If the number is
> greater than zero it correlates with quality, if it is 0 it does not
> correlate, and if it is less than 0 it is negatively correlated with
> quality. The scale of the numbers is meaningless and not interpretable,
> although the relative magnitude is important. These are just the relative
> performance of each feature for each class, as extracted from the weights of
> a random forests classifier.
>
> http://grey.colorado.edu/mediawiki/sites/mingus/images/1/1e/DeHoustMangalat…
Any chance you can run a similar analysis to look for correlations
with page-views?
I think Liam was originally looking for justification to improve
article content in order for the article to attain higher page-views,
as he has his own private scientific evidence that higher page-views
results in a higher click-though rate (hopefully not with a sample
size of one museum?).
--
John Vandenberg
---------- Forwarded message ----------
From: Brian <Brian.Mingus(a)colorado.edu>
Date: Wed, Jun 2, 2010 at 10:46 PM
Subject: Re: [Wiki-research-l] Quality and pageviews
To: Liam Wyatt <liamwyatt(a)gmail.com>
Interestingly, the result is negative. The correlation coefficient between
2500 featured articles and 2500 random articles is .18 which is very low. I
also trained a linear classifier to predict the quality of an article based
on the number of page views and it was no better than chance.
This seems to make sense. People tend to spend a lot of time on articles
which are their pet topics which we all know is negatively correlated with
the historicity of the topic, leading to better Pokemon articles etc.. than
we have on important historical figures in many cases. This result could
probably be more fleshed out by looking at the correlation between the
Wikipedia Editorial Team's ratings of quality and importance.
It turns out that this is easy to do visually. Here is their summary table.
Many of the most important articles are of low quality, although the
converse is not really true. Thus there is some median threshold of
importance a topic must get over before someone will consider it a good pet
project. That's my take on your question. Cheers.
All rated articles by quality and importance *Quality* *Importance*
*Top<http://en.wikipedia.org/wiki/Category:Top-importance_articles>
* *High <http://en.wikipedia.org/wiki/Category:High-importance_articles>* *
Mid <http://en.wikipedia.org/wiki/Category:Mid-importance_articles>*
*Low<http://en.wikipedia.org/wiki/Category:Low-importance_articles>
* *??? <http://en.wikipedia.org/wiki/Category:Unassessed-Class_articles>* *
Total* *FA <http://en.wikipedia.org/wiki/Category:FA-Class_articles>*
672982852415232
*3,153* *FL <http://en.wikipedia.org/wiki/Category:FL-Class_articles>*
101393332401168
*1,395* *A <http://en.wikipedia.org/wiki/Category:A-Class_articles>*
84 26526212217
*750* *GA <http://en.wikipedia.org/wiki/Category:GA-Class_articles>*
8011,9003,1122,554912
*9,279* *B <http://en.wikipedia.org/wiki/Category:B-Class_articles>*
7,20615,24623,45517,35812,252
*75,517* *C <http://en.wikipedia.org/wiki/Category:C-Class_articles>*
3,0238,77918,29218,27813,137
*61,509* *Start <http://en.wikipedia.org/wiki/Category:Start-Class_articles>
* 7,803 37,581 144,561 248,669 169,614 *608,228*
*Stub<http://en.wikipedia.org/wiki/Category:Stub-Class_articles>
* 2,536 19,020 122,322 638,946 791,594 *1,574,418*
*List<http://en.wikipedia.org/wiki/Category:List-Class_articles>
* 1,284 3,537 8,739 20,067 19,039 *52,666* *Assessed* 23,510 87,703
321,927946,8101,006,965
*2,386,915* *Unassessed<http://en.wikipedia.org/wiki/Category:Unassessed-Class_articles>
* 260 988 3,264 15,253 362,711 *382,476* *Total* *23,770* *88,691* *325,191*
*962,063* *1,369,676* *2,769,391*
On Wed, Jun 2, 2010 at 9:36 AM, Brian <Brian.Mingus(a)colorado.edu> wrote:
> Alright I will run it. It will probably be this evening before it is
> finished. I'll let you know the results.
>
>
> On Wed, Jun 2, 2010 at 9:30 AM, Liam Wyatt <liamwyatt(a)gmail.com> wrote:
>
>> I think you are greatly overestimating my technical abilities...
>> I don't even know where to find python and scipy and even if I did, I
>> wouldn't know what to do with them to make them "go". I'm a
>> historian-wikipedian, not a techie-wikipedian :-)
>>
>> Nevertheless, I will link out to your script in the blogpost I'm currently
>> writing for those who may wish to run it themselves.
>> All the best,
>> -Liam
>>
>> wittylama.com/blog
>> Peace, love & metadata
>>
>>
>> On 2 June 2010 16:24, Brian J Mingus <Brian.Mingus(a)colorado.edu> wrote:
>>
>>> Hey, all you have to do is install Python and SciPy, save the script to
>>> somefilename.py, and then run python somefilename.py.
>>>
>>>
>>> On Sat, May 29, 2010 at 5:17 PM, Liam Wyatt <liamwyatt(a)gmail.com> wrote:
>>>
>>>> Ok - your reasoning is sound. I'm happy that the "FA of the day" won't
>>>> distort the results :-) Thanks.
>>>>
>>>> On another note however, now is the first time I've been able to look at
>>>> your link (before I was on my phone) and I must admit I have no idea what to
>>>> do with it. I'm no coder so I don't know how to go from
>>>> http://grey.colorado.edu/mingus/index.php/Wikipedia_Quality_Pageviews_Corre… an actual result... Can you give me a hint?
>>>>
>>>>
>>>> -Liam
>>>>
>>>> wittylama.com/blog
>>>> Peace, love & metadata
>>>>
>>>>
>>>> On 29 May 2010 16:50, Brian J Mingus <Brian.Mingus(a)colorado.edu> wrote:
>>>>
>>>>> I think this technique is perfectly scientific. If you sample 2000
>>>>> featured and 2000 random articles there is a 1.5% chance (30/2000) that the
>>>>> featured article was on the front page last month. If you like, you can make
>>>>> a list and exclude them. But the effect is very small.
>>>>>
>>>>>
>>>>> On Sat, May 29, 2010 at 10:24 AM, Brian <Brian.Mingus(a)colorado.edu>wrote:
>>>>>
>>>>>> I only looked at last month so it is not sensitive to the article
>>>>>> being on the front page except by chance. Looking at Good articles is not a
>>>>>> good idea, you will get a very weak correlation.
>>>>>>
>>>>>>
>>>>>> On Sun, May 30, 2010 at 5:31 AM, Liam Wyatt <liamwyatt(a)gmail.com>wrote:
>>>>>>
>>>>>>> Actually Brian, does your tool take account of the unusual spike in
>>>>>>> pageviews that a featured article receives when it's FA of the day? I'd like
>>>>>>> to be able to say that the article receives more views even without special
>>>>>>> events like that. Is it possible to. Remove that day for each of the FAs
>>>>>>> from the stats. Or, alternatively, can the script be run looking for Good
>>>>>>> Articles rather than FAs as these don't get special publicity but are still
>>>>>>> better quality than the average?
>>>>>>>
>>>>>>> -Liam
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 29/05/2010, at 2:48, Brian J Mingus <Brian.Mingus(a)Colorado.EDU>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, May 28, 2010 at 6:51 PM, Brian < <Brian.Mingus(a)colorado.edu>
>>>>>>> Brian.Mingus(a)colorado.edu> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, May 28, 2010 at 5:31 PM, Liam Wyatt < <liamwyatt(a)gmail.com>
>>>>>>>> liamwyatt(a)gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Does anyone know of any research which demonstrates a correlation
>>>>>>>>> between the quality of an article in Wikipedia and the number of pageviews
>>>>>>>>> it receives?
>>>>>>>>> I'm trying to argue, as part of my work with museums, that if they
>>>>>>>>> want to get people to come to their own website from wikipedia, then instead
>>>>>>>>> of focusing on adding top-level links to the "external links" section, they
>>>>>>>>> should instead focus on adding deep-linked footnotes and encouraging the
>>>>>>>>> improvement of the quality of the article in general. This is on the basis
>>>>>>>>> that "increased quality=increased pageviews=increased clickthroughs". A
>>>>>>>>> win-win situation.
>>>>>>>>>
>>>>>>>>> I have some very nice statistics of pageviews (from
>>>>>>>>> <http://stats.grok.se>stats.grok.se) that match the clickthrough
>>>>>>>>> stats from a museum (from their analytics) that clearly demonstrate a
>>>>>>>>> correlation between increased pageviews and increased clickthroughs.
>>>>>>>>> However, I'm still looking for some scientific data to prove the link
>>>>>>>>> between improved article quality and an increase in the number of views of
>>>>>>>>> that article.
>>>>>>>>>
>>>>>>>>> Can anyone help?
>>>>>>>>>
>>>>>>>>> -Liam
>>>>>>>>
>>>>>>>>
>>>>>>>> An easy way to pull this off looks like this:
>>>>>>>>
>>>>>>>> - Create a list of all the featured articles
>>>>>>>> - Use <http://stats.grok.se>http://stats.grok.se to count their
>>>>>>>> pageviews
>>>>>>>> - Generate a random list of articles
>>>>>>>> - Check their pageviews
>>>>>>>> - Compute correlation between pageviews / quality
>>>>>>>>
>>>>>>>> You can also do a slightly more detailed analysis by using the
>>>>>>>> Wikipedia Editorial Team's ratings, but caveat emptor, their rating system
>>>>>>>> is not satistifactory as I've shown in a couple of papers. Nonetheless, if
>>>>>>>> you group together Featured/A/B articles and Start/Stub articles into two
>>>>>>>> groups, the result should be solid.
>>>>>>>>
>>>>>>>>
>>>>>>>> - DeHoust, C., Mangalath, P., Mingus., B. (2008). *Improving
>>>>>>>> search in Wikipedia through quality and concept discovery*.
>>>>>>>> Technical Report. PDF<http://grey.colorado.edu/mediawiki/sites/mingus/images/6/68/DeHoustMangalat…>
>>>>>>>> - Rassbach, L., Mingus., B, Blackford, T. (2007). *Exploring the
>>>>>>>> feasibility of automatically rating online article quality*.
>>>>>>>> Technical Report. PDF<http://grey.colorado.edu/mediawiki/sites/mingus/images/d/d3/RassbachPincock…>
>>>>>>>>
>>>>>>>>
>>>>>>> Liam,
>>>>>>>
>>>>>>> I've written you a quick bot that computes the correlation
>>>>>>> coefficient between quality and pageviews for random samples of featured and
>>>>>>> random articles in April. It requires Python and SciPy. You can find it
>>>>>>> here:
>>>>>>>
>>>>>>>
>>>>>>> <http://grey.colorado.edu/mingus/index.php/Wikipedia_Quality_Pageviews_Corre…>
>>>>>>> http://grey.colorado.edu/mingus/index.php/Wikipedia_Quality_Pageviews_Corre…
>>>>>>>
>>>>>>> I suggest that you cross validate the result. In other words, run the
>>>>>>> script several times and then average the results. It's rather slow, owing
>>>>>>> to the slowness of Wikimedia properties. I personally didn't run it for more
>>>>>>> than 10 featured and 10 random articles, but I set those numbers to 2000 for
>>>>>>> you. It will take hours to run.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Brian
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>