Re: [Wiki-research-l] Project exploring automated classification of article importance

28 Apr 2017

Re: initial setting of Importance in project tags.

I don't offer any evidence for my claim that initial project tagging often gets
importance correct, it's just my observation that this is so. Since importance is
about the topic importance rather than article, I suspect it can be reliably assigned on a
stub article.

However, I think if we looked at a project which is known to be diligent in their tagging
(your collaboration with WikiProject Medicine might have this data), I would still be very
interested to compare the start dates of the articles relative to their current importance
to test my hypothesis that the more important an article is, the more likely it is to have
started earlier.

And for those articles which have had their importance raised over time, did the articles
have  increasing pageviews, either in a sustained way or as a series upward spikes (which
might suggest a growing real-world interest in the topic) between the initial tagging and
the re-tagging. And for those articles which had their importance reduced over time, did
it correspond to diminishing levels of pageviews (I presume downward spikes are an
unlikely phenomenon - although that would be an interesting question to ask across
Wikipedia generally to confirm my theory that they don’t occur) suggesting declining
real-world interest. Or to put it another way, did the re-assignment of article importance
reflect the topic's changing importance in the real world or not (for which I think
pageviews are the best proxy) or, if it occurs in a way apparently unrelated to real-world
interest, is it a case of the original tagging simply being "wrong"? Obviously
some project taggers may be more knowledgable about the topic space than others while some
taggers may have POV or COI reasons for overstating/understating a topc's importance.

So many interesting questions, inquiring minds want to know ....

Kerry

-----Original Message-----
From: Wiki-research-l [mailto:wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of
Morten Wang
Sent: Friday, 28 April 2017 10:48 AM
To: Research into Wikimedia content and communities
&lt;wiki-research-l(a)lists.wikimedia.org&gt;
Subject: Re: [Wiki-research-l] Project exploring automated classification of article
importance

Thanks for the thoughtful comments, Kerry! There were many great points in your email,
I'd like to focus on some of them.

Your likening of viewership to readers and inlinks to writers echoes how we think about
this as well. I agree that these two groups differ on many characteristics, something both
the contributor surveys you mention shows, as well as research. For example West et
al's 2012 paper (see citation
below) looks at how the browsing history shows differing interests between readers and
contributors, and the "WP:Clubhouse" paper (Lam et al, 2011) starts getting at
how the gender proportions differ (there are of course other papers as well, these were
the first that came to mind). By combining both, we get more signal.

This also touches on the discussion of how popularity is related to importance, and
whether importance changes over time. The article about Drottninggatan in Stockholm is but
one example of an article that becomes the center of attention due to a breaking news
event. We did an analysis of a dataset of very popular articles in our 2015 ICWSM paper,
finding that about half of them show this kind of transient behaviour. In that paper we
argue that the more popular articles are more important and should have higher quality,
which means that it's partly chasing a moving target and partly a focused effort on
the long-term important content (of which pasteurization is probably one example). For
some topics it is easier to predict their shifts in importance because they are seasonal,
e.g.
christmas, easter, or sporting events like world championships. When it comes to others it
might be harder, e.g. Trump, or Google Flu Trends, which I recently came across. How
important is the latter article now that the website is no longer available?

When it comes to links, you point out that they are not all equal. This is something
we're incorporating in our work. Currently we have a model for WikiProject Medicine,
and it accounts for both inlinks from all across English Wikipedia, as well as to what
extent they come from other articles tagged by the project. We also use the clickstream
dataset to add information about whether an article's traffic comes from other
Wikipedia articles, meaning it is useful as supporting information for those, or whether
it comes from elsewhere. Lastly, we use the clickstream dataset to get an idea about how
many inlinks to an article are actually used. As you write, the links in the lede are more
important, something at least one research paper points to (Dimitrov et al, 2016), and
something the clickstream dataset allows us to estimate. I think it's great to see
these ideas pop up in the discussion and be able to show how we're incorporating these
into what we're doing and that they affect our results.

As I wrap up, I would like to challenge the assertion that initial importance ratings are
"pretty accurate". I'm not sure we really know that.
They might be, but it might be because the vast majority of them are newly created stubs
that get rated "low importance". More interesting are perhaps other types of
articles, where I suspect that importance ratings are copied from one WikiProject template
to another, and one could argue that they need updating. Our collaboration with
WikiProject Medicine has resulted in updated ratings of a couple of hundred or so articles
so far, although most of them were corrections that increase consistency in the ratings.
As I continue working on this project I hope to expand our collaborations to other
WikiProjects, and I'm looking forward to seeing how well we fare with those!

Citations:
West, R.; Weber, I.; and Castillo, C. 2012. Drawing a Data-driven Portrait of Wikipedia
Editors. In Proc. of OpenSym/WikiSym, 3:1–3:10.

Lam, S. T. K.; Uduwage, A.; Dong, Z.; Sen, S.; Musicant, D. R.; Terveen, L.; and Riedl, J.
2011. WP:Clubhouse?: An Exploration of Wikipedia's Gender Imbalance. In Proc. of
WikiSym, 1–10.

Warncke-Wang, M., Ranjan, V., Terveen, L., and Hecht, B. "Misalignment Between Supply
and Demand of Quality Content in Peer Production Communities" in the proceedings of
ICWSM 2015.

Dimitrov, D., Singer, P., Lemmerich, F., & Strohmaier, M. (2016, April).
Visual positions of links and clicks on wikipedia. In Proceedings of the 25th
International Conference Companion on WWW (pp. 27-28).

Cheers,
Morten

On 25 April 2017 at 20:39, Kerry Raymond &lt;kerry.raymond(a)gmail.com&gt; wrote:

...
  Just a few musings on the issue of Importance and how
to research it ...

 I agree it is intuitive that importance is likely to be linked to 
 pageviews and inbound links but, as the preliminary experiment showed, 
 it's probably not that simple.

 Pageviews tells us something about importance to readers of Wikipedia, 
 while inbound links tells us something about importance to writers of 
 Wikipedia, and I suspect that writers are not a proxy for readers as 
 the editor surveys suggest that Wikipedia writers are not typical of 
 broader society on at least two variables: gender and level of 
 education (might be others, I can't remember).

 But I think importance is a relative metric rather than  absolute. I 
 think by taking the mean value of importance across a number of 
 WikiProjects in the preliminary experiment may have lost something 
 because it tried (through averaging) to look at importance 
 "generally". I would suspect conducting an experiment considering only 
 the importance ratings wrt to a single WikiProject would be more 
 likely to show correlation with pageviews (wrt to other articles in 
 that same WikiProject) and inbound links. And I think there are two 
 kinds of inbound links to be considered, those coming from other 
 articles within the same WikiProject and those coming from outside 
 that Wikiproject. I suspect different insights will be obtained by 
 looking at both types of inbound links separately rather than treating 
 them as an aggregate. I note also that WikiProjects are not entirely 
 independent of one another but have relationships between them. For 
 example, The WikiProject Australian Roads describes itself as an 
 "intersection" (ha ha!) of WikiProject Highways and WikiProject 
 Australia, so I expect that we would find greater correlation in importance between
related WikiProjects than between unrelated WikiProjects.

 When thinking about readers and pageviews, I think we have to ask 
 ourselves is there a difference between popularity and importance. Or 
 whether popularity *is* importance. I sense that, as a group of 
 educated people, those of us reading this research mailing list 
 probably do think there is a difference. Certainly if there is no 
 difference, then this research can stop now -- just judge importance 
 by  pageviews. Let's assume a difference then. When looking at 
 pageviews of an article, they are not always consistent over time. 
 Here are the pageviews for Drottninggatan

 https://tools.wmflabs.org/pageviews/?project=en.wikipedia.
 org&platform=all-access&agent=user&range=latest-90&pages=Drottninggata
 n

 Why so interesting on 8 April? A terrorist attack occurred there. This 
 spike in pageviews occurs all the time when some topic is in the news 
 (even peripherally as in this case where it is not the article about 
 the terrorist attack but about the street in which it occurred). Did 
 the street become more "important"? I think it became more interesting 
 but not more important. So I think we do have to be careful to 
 understand that pageviews probably reflect interest rather than 
 importance.  I note that The Chainsmokers (a music group with a number 
 of songs in the current USA music
 charts) gets many more Wikipedia article pageviews  than the Wikipedia 
 article on Pasteurization but The Chainsmokers are not rated as being 
 of high importance by the relevant WikiProjects while Pasteurization 
 is very important in WikiProject Food and Drink. Since pasteurisation 
 prevents a lot of deaths, I think we might agree that in the real 
 world pasteurisation is more important than a music group regardless of what pageviews
tell us.

 https://tools.wmflabs.org/pageviews/?project=en.wikipedia.
 org&platform=all-access&agent=user&range=latest-90&pages=The_Chainsmok
 ers|
 Pasteurization

 Of course it is matters for Wikipedia's success that our *popular* 
 articles are of high quality, but I think we have be cautious about 
 pageviews being a proxy for importance.

 When we look at Wikipedia writers' decisions in tagging the importance 
 of articles to WikiProjects, what do we find? As we know, project tags 
 are often placed on new articles (and often not subsequently 
 reviewed). So while I find that quality tags are often out-of-date, 
 the importance seems to be pretty accurate even on a new stub 
 articles. This is because it is the importance of the *topic* that is 
 being assessed which is independent of the Wikipedia article itself. 
 Provided the article is clear enough about what it is about and why it 
 matters (which is the traditional content of that first paragraph or 
 two and failing to provide it will likely result in speedy deletion of 
 the new article), assessment of the topic's importance can be made 
 even at new stub level. This tells us that importance for Wikipedia 
 writers is determined by something outside of Wikipedia (probably 
 their real-world knowledge of that topic space -- one assumes that 
 project taggers are quite interested in the topic space of that 
 project). While article quality hopefully improves over time, I would 
 be surprised if article importance greatly changed over time. 
 Obviously there are counter-examples.  I am guessing Donald Trump's article may have
grown in importance over time but that's probably because his lede para changed.
 Adding President of the USA into the lede paragraph makes him much 
 more important than he was before in the real world and internal to 
 Wikipedia he has acquired an inbound link from the presumably 
 high-importance President of the USA article. So I think it might be 
 interesting to study those articles whose importance does change over 
 time to see if there are any strong correlations with what is happening to the article
inside Wikipedia.
 I think it is this set of importance-changing articles may be where we 
 really learn what Wikipedia article characteristics are strongly 
 correlated to "importance" given that importance itself appears to be 
 pretty stable for most articles.

 Although not stated explicitly, I imagine we believe that generally 
 less important articles tend to link to more important articles but 
 more important articles don't link to less important articles. And 
 hence in-bound links are likely to matter in assessing importance and 
 that in-bound links from "important" articles are more valuable than 
 in-bound links from less important articles (which creates something 
 of a bootstrapping problem) similar to the issue to Google's PageRank 
 algorithms. But I think we do have some information that Google 
 doesn't have. The average webpage does not have a lede paragraph that 
 situates the topic relative to other topics; a Wikipedia article does. 
 If I have to choose to define Thing X in terms of Thing Y, it tends to 
 suggest that Y is more important than X. If Y also defines itself in 
 terms of X, then it tends to suggest they are equivalent in importance 
 at some way. Indeed I suspect when we get to the VERY IMPORTANT topics 
 we will see this kind of circular definition (e.g. you see circular 
 definitions in Wikipedia around Philosophy and Knowledge). Aside, if 
 you have never done this before, try this experiment. Choose a random 
 article (left hand tool bar in Desktop Wikipedia), then click the first link in the
article that matters (i.e.
 ignore links hatnotes or links inside parentheses). Repeat this first 
 link clicking and sooner or later you will reach articles like 
 Knowledge and Philosophy, which all sit inside circular definition groups.

 If we look at the Donald Trump article, his first sentence contains 
 only two links, one to List of Presidents of the USA and the other to 
 President of the USA. If we look at the those two articles, we find 
 that both of them mention Donald Trump in their lede paras (although 
 not as early as the first sentence) and before mentions of any other 
 US President elsewhere in the article. Which is consistent with what 
 we know about the real world, the role of the President is more 
 important than its officeholders and that the current officeholder has 
 more importance than a past officeholder. So topic importance does seems to be skewed
towards the "present day".

 So I suspect the links in the lede paras are of greater relevance to 
 the assessment of importance than links further down in the article 
 which will be more likely relate to details of a topic and may include 
 examples and counter-examples (this is a way in which high importance 
 article may mention much lower importance articles). However, we do 
 have to be a little bit careful here because of the MoS practice of 
 not linking very common terms. For example, an Australian article will 
 often refer to Australia in the lede para but it will almost certainly 
 not be linked to the Australia article (and any attempt to add such a 
 link will likely see it removed with an edit summary that mentions 
 [[WP:Overlinking]]) whereas there is no problem if you link to an Australian state
article, e.g. New South Wales.
 So we might find that some very important topics that often appear in 
 ledes might get fewer links that you might expect because of the MoS 
 policies on overlinking, which may be problem when working with 
 inbound links. It may be that for "very common topics" the presence of 
 the article title (or its
 synonyms) in the lede may have to be considered as if it were an 
 in-bound link for statistical research purposes.

 Given all of the above, perhaps the most interesting group of articles 
 to study in Wikipedia are those articles whose manually-assessed 
 importance has changed over the life of the article AND which were NOT 
 current topics in the lifetime of Wikipedia (given the influence of 
 "current" on importance). But having said that, I wonder if that group 
 of articles actually exists. Recently a newish Australian contributor 
 expressed disappointment that all the new articles they had created 
 were tagged (by
 others) as of Low Importance. My instinctive reply was "that's normal, 
 I think of the thousands of articles I have started only a couple even 
 rated as Mid importance, this is because the really important articles 
 were all started long ago precisely because they were important". I 
 suspect topics that are very important (for reasons other than being 
 short-lived importance due in being "current" in the lifetime of 
 Wikipedia) will generally show up as having started early in 
 Wikipedia's life and that those that become more/less important over 
 time will be largely linked to becoming or ceasing to be "current" 
 topics). E.g. article Pasteurization started in May 2001 saying 
 nothing more than " Pasteurization is the process of killing off 
 bacteria in milk by quickly heating it to a near boiling temperature, 
 then quickly cooling it again before the taste and other desirable 
 properties are affected. The process was named after its inventor, 
 French scientist Louis Pasteur. See also dairy products." The links in 
 this very first version are still present in its lede paragraph today, 
 suggesting our understanding of "non-current" topics is stable and hence
initial importance determinations can probably be accurately made.
 For Pasteurization the Talk page shows it was not project-tagged until 
 2007 when it was assigned High Importance as its first assessment.

 I suspect we will find that initial manual assessment of article 
 importance will be pretty accurate for most articles. And I suspect if 
 we plot initial importance assessments against time of assessment, we 
 will find the higher importance articles commenced life on Wikipedia 
 earlier than the lower importance articles. If I am correct, then 
 there isn't a lot of value in machine-assessment of importance of 
 topics because it relates to factors external to Wikipedia and often 
 does not change over time and therefore can often be correctly 
 assessed manually even on new stub articles (and any unassessed 
 articles can probably be rated as Low Importance as statistically that's almost
certainly going to be correct).
 If a topic becomes more important due to "current" events, then 
 invariably that article will be updated by many people and one of them 
 will sooner or later manually adjust its importance. What is less 
 likely to happen is re-assessing downwards of Importance when an 
 important "current" topic loses its importance when it is no longer 
 current, e.g. are former American presidents like Barack Obama or 
 George W Bush or further back less important now? These articles will 
 not be updated frequently once the topic is no longer in the news and 
 therefore it is less likely an editor will notice and manually 
 downgrade the importance, so there may be a greater role for 
 machine-assessment in downgrading importance rather than upgrading importance.

 Another area where there might be a role for machine-assessed 
 importance in regards to POV-pushing where an POV-motivated editor 
 might change the manual-assessment importance of articles to be higher 
 or lower based on their POV (e.g. my political party is Top 
 Importance, other parties are of Low Importance). I suspect that often 
 a page watcher would correct or at least question that kind of 
 re-assessment. However, articles with few active pagewatchers you 
 might get away with POV-pushing the article's importance tag because 
 nobody noticed. In this situation, a machine assessment could be useful in spotting this
kind of thing.

 This suggests that another metric of interest to importance might be 
 number of pagewatchers, although I suspect that pagewatching may 
 relate more to caring about the article than to caring about the 
 topic. And one has to be careful to distinguish active pagewatchers 
 (those who actually do review changes on their watchlists) from those 
 who don't, as that may make a difference (although I am not sure we 
 can really tell which pagewatchers are truly actively reviewing as a 
 "satisfactory review" doesn't leave a trace whereas an 
 "unsatisfactory" review is likely to lead to a relatively soon revert 
 or some other change to the article, the article Talk or the User Talk of reviewed
contributor which may be detectable).

 The other aspect of articles that occurs to me as being possibly 
 linked to importance of the topic would be use of the article as the 
 "main" article for a category or as the title of a navbox (as it 
 suggests that the articles in the category or navbox are in some way 
 subordinate to the main/title article). Similarly for list articles, 
 the "type" of the list is often more important than its instances).

 Kerry

 -----Original Message-----
 From: Wiki-research-l 
 [mailto:wiki-research-l-bounces@lists.wikimedia.org]
 On Behalf Of Morten Wang
 Sent: Friday, 21 April 2017 6:04 AM
 To: Research into Wikimedia content and communities < 
 wiki-research-l(a)lists.wikimedia.org&gt;
 Subject: Re: [Wiki-research-l] Project exploring automated 
 classification of article importance

 Hi Pine,

 These are great pointers to existing practices on enwiki, some of 
 which I've been looking for and/or missed, thanks!

 Cheers,
 Morten

 On 19 April 2017 at 22:35, Pine W &lt;wiki.pine(a)gmail.com&gt; wrote:

  Hi Nettrom,

 A few resources from English Wikipedia regarding article importance 
 as ranked by humans:

 https://en.wikipedia.org/wiki/Wikipedia:Vital_articles

 https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_
 Editorial_Team/Release_Version_Criteria#Priority_of_topic

 https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Stati
 st
 ics

 I infer from the ENWP Wikicup's scoring protocol that for purposes 
 of the competition, an article's "importance" is loosely inferred 
 from the number of language editions of Wikipedia in which the 
 article  appears:

https://en.wikipedia.org/wiki/Wikipedia:WikiCup/Scoring#Bonus_points.

 HTH,

 Pine

 On Tue, Apr 18, 2017 at 4:17 PM, Morten Wang &lt;nettrom(a)gmail.com&gt; wrote:

  Hello everyone,

 I am currently working with Aaron Halfaker and Dario Taraborelli 
 at the Wikimedia Foundation on a project exploring automated 
 classification of article importance. Our goal is to characterize 
 the importance of an article within a given context and design a 
 system to predict a relative importance rank. We have a project 
 page on meta[1] and welcome comments  or
  thoughts on our talk page. You can of course also
respond here on 
 wiki-research-l, or send me an email.

 Before moving on to model-building I did a fairly thorough 
 literature review, finding a myriad of papers spanning several 
 disciplines. We have  a
  draft literature review also up on meta[2], which
should give you 
 a reasonable introduction to the topic. Again, comments or thoughts (e.g.
 papers we’ve missed) on the talk page, mailing list, or through 
 email are welcome.

 Links:

    1. https://meta.wikimedia.org/wiki/Research:Automated_
    classification_of_article_importance
    <https://meta.wikimedia.org/wiki/Research:Automated_
 classification_of_article_importance>
    2. 
 https://meta.wikimedia.org/wiki/Research:Studies_of_Importance

 Regards,
 Morten
 [[User:Nettrom]] aka [[User:SuggestBot]] 
 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
  _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
  _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
 _______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] Project exploring automated classification of article importance