Re: [Wiki-research-l] Project exploring automated classification of article importance

28 Apr 2017

Thanks for the thoughtful comments, Kerry! There were many great points in
your email, I'd like to focus on some of them.

Your likening of viewership to readers and inlinks to writers echoes how we
think about this as well. I agree that these two groups differ on many
characteristics, something both the contributor surveys you mention shows,
as well as research. For example West et al's 2012 paper (see citation
below) looks at how the browsing history shows differing interests between
readers and contributors, and the "WP:Clubhouse" paper (Lam et al, 2011)
starts getting at how the gender proportions differ (there are of course
other papers as well, these were the first that came to mind). By combining
both, we get more signal.

This also touches on the discussion of how popularity is related to
importance, and whether importance changes over time. The article about
Drottninggatan in Stockholm is but one example of an article that becomes
the center of attention due to a breaking news event. We did an analysis of
a dataset of very popular articles in our 2015 ICWSM paper, finding that
about half of them show this kind of transient behaviour. In that paper we
argue that the more popular articles are more important and should have
higher quality, which means that it's partly chasing a moving target and
partly a focused effort on the long-term important content (of which
pasteurization is probably one example). For some topics it is easier to
predict their shifts in importance because they are seasonal, e.g.
christmas, easter, or sporting events like world championships. When it
comes to others it might be harder, e.g. Trump, or Google Flu Trends, which
I recently came across. How important is the latter article now that the
website is no longer available?

When it comes to links, you point out that they are not all equal. This is
something we're incorporating in our work. Currently we have a model for
WikiProject Medicine, and it accounts for both inlinks from all across
English Wikipedia, as well as to what extent they come from other articles
tagged by the project. We also use the clickstream dataset to add
information about whether an article's traffic comes from other Wikipedia
articles, meaning it is useful as supporting information for those, or
whether it comes from elsewhere. Lastly, we use the clickstream dataset to
get an idea about how many inlinks to an article are actually used. As you
write, the links in the lede are more important, something at least one
research paper points to (Dimitrov et al, 2016), and something the
clickstream dataset allows us to estimate. I think it's great to see these
ideas pop up in the discussion and be able to show how we're incorporating
these into what we're doing and that they affect our results.

As I wrap up, I would like to challenge the assertion that initial
importance ratings are "pretty accurate". I'm not sure we really know that.
They might be, but it might be because the vast majority of them are newly
created stubs that get rated "low importance". More interesting are perhaps
other types of articles, where I suspect that importance ratings are copied
from one WikiProject template to another, and one could argue that they
need updating. Our collaboration with WikiProject Medicine has resulted in
updated ratings of a couple of hundred or so articles so far, although most
of them were corrections that increase consistency in the ratings. As I
continue working on this project I hope to expand our collaborations to
other WikiProjects, and I'm looking forward to seeing how well we fare with
those!

Citations:
West, R.; Weber, I.; and Castillo, C. 2012. Drawing a Data-driven Portrait
of Wikipedia Editors. In Proc. of OpenSym/WikiSym, 3:1–3:10.

Lam, S. T. K.; Uduwage, A.; Dong, Z.; Sen, S.; Musicant, D. R.; Terveen,
L.; and Riedl, J. 2011. WP:Clubhouse?: An Exploration of Wikipedia's Gender
Imbalance. In Proc. of WikiSym, 1–10.

Warncke-Wang, M., Ranjan, V., Terveen, L., and Hecht, B. "Misalignment
Between Supply and Demand of Quality Content in Peer Production
Communities" in the proceedings of ICWSM 2015.

Dimitrov, D., Singer, P., Lemmerich, F., & Strohmaier, M. (2016, April).
Visual positions of links and clicks on wikipedia. In Proceedings of the
25th International Conference Companion on WWW (pp. 27-28).

Cheers,
Morten

On 25 April 2017 at 20:39, Kerry Raymond &lt;kerry.raymond(a)gmail.com&gt; wrote:

...
  Just a few musings on the issue of Importance and how
to research it ...

 I agree it is intuitive that importance is likely to be linked to
 pageviews and inbound links but, as the preliminary experiment showed, it's
 probably not that simple.

 Pageviews tells us something about importance to readers of Wikipedia,
 while inbound links tells us something about importance to writers of
 Wikipedia, and I suspect that writers are not a proxy for readers as the
 editor surveys suggest that Wikipedia writers are not typical of broader
 society on at least two variables: gender and level of education (might be
 others, I can't remember).

 But I think importance is a relative metric rather than  absolute. I think
 by taking the mean value of importance across a number of WikiProjects in
 the preliminary experiment may have lost something because it tried
 (through averaging) to look at importance "generally". I would suspect
 conducting an experiment considering only the importance ratings wrt to a
 single WikiProject would be more likely to show correlation with pageviews
 (wrt to other articles in that same WikiProject) and inbound links. And I
 think there are two kinds of inbound links to be considered, those coming
 from other articles within the same WikiProject and those coming from
 outside that Wikiproject. I suspect different insights will be obtained by
 looking at both types of inbound links separately rather than treating them
 as an aggregate. I note also that WikiProjects are not entirely independent
 of one another but have relationships between them. For example, The
 WikiProject Australian Roads describes itself as an "intersection" (ha ha!)
 of WikiProject Highways and WikiProject Australia, so I expect that we
 would find greater correlation in importance between related WikiProjects
 than between unrelated WikiProjects.

 When thinking about readers and pageviews, I think we have to ask
 ourselves is there a difference between popularity and importance. Or
 whether popularity *is* importance. I sense that, as a group of educated
 people, those of us reading this research mailing list probably do think
 there is a difference. Certainly if there is no difference, then this
 research can stop now -- just judge importance by  pageviews. Let's assume
 a difference then. When looking at pageviews of an article, they are not
 always consistent over time. Here are the pageviews for Drottninggatan

 https://tools.wmflabs.org/pageviews/?project=en.wikipedia.
 org&platform=all-access&agent=user&range=latest-90&pages=Drottninggatan

 Why so interesting on 8 April? A terrorist attack occurred there. This
 spike in pageviews occurs all the time when some topic is in the news (even
 peripherally as in this case where it is not the article about the
 terrorist attack but about the street in which it occurred). Did the street
 become more "important"? I think it became more interesting but not more
 important. So I think we do have to be careful to understand that pageviews
 probably reflect interest rather than importance.  I note that The
 Chainsmokers (a music group with a number of songs in the current USA music
 charts) gets many more Wikipedia article pageviews  than the Wikipedia
 article on Pasteurization but The Chainsmokers are not rated as being of
 high importance by the relevant WikiProjects while Pasteurization is very
 important in WikiProject Food and Drink. Since pasteurisation prevents a
 lot of deaths, I think we might agree that in the real world pasteurisation
 is more important than a music group regardless of what pageviews tell us.

 https://tools.wmflabs.org/pageviews/?project=en.wikipedia.

org&platform=all-access&agent=user&range=latest-90&pages=The_Chainsmokers|
 Pasteurization

 Of course it is matters for Wikipedia's success that our *popular*
 articles are of high quality, but I think we have be cautious about
 pageviews being a proxy for importance.

 When we look at Wikipedia writers' decisions in tagging the importance of
 articles to WikiProjects, what do we find? As we know, project tags are
 often placed on new articles (and often not subsequently reviewed). So
 while I find that quality tags are often out-of-date, the importance seems
 to be pretty accurate even on a new stub articles. This is because it is
 the importance of the *topic* that is being assessed which is independent
 of the Wikipedia article itself. Provided the article is clear enough about
 what it is about and why it matters (which is the traditional content of
 that first paragraph or two and failing to provide it will likely result in
 speedy deletion of the new article), assessment of the topic's importance
 can be made even at new stub level. This tells us that importance for
 Wikipedia writers is determined by something outside of Wikipedia (probably
 their real-world knowledge of that topic space -- one assumes that project
 taggers are quite interested in the topic space of that project). While
 article quality hopefully improves over time, I would be surprised if
 article importance greatly changed over time. Obviously there are
 counter-examples.  I am guessing Donald Trump's article may have grown in
 importance over time but that's probably because his lede para changed.
 Adding President of the USA into the lede paragraph makes him much more
 important than he was before in the real world and internal to Wikipedia he
 has acquired an inbound link from the presumably high-importance President
 of the USA article. So I think it might be interesting to study those
 articles whose importance does change over time to see if there are any
 strong correlations with what is happening to the article inside Wikipedia.
 I think it is this set of importance-changing articles may be where we
 really learn what Wikipedia article characteristics are strongly correlated
 to "importance" given that importance itself appears to be pretty stable
 for most articles.

 Although not stated explicitly, I imagine we believe that generally less
 important articles tend to link to more important articles but more
 important articles don't link to less important articles. And hence
 in-bound links are likely to matter in assessing importance and that
 in-bound links from "important" articles are more valuable than in-bound
 links from less important articles (which creates something of a
 bootstrapping problem) similar to the issue to Google's PageRank
 algorithms. But I think we do have some information that Google doesn't
 have. The average webpage does not have a lede paragraph that situates the
 topic relative to other topics; a Wikipedia article does. If I have to
 choose to define Thing X in terms of Thing Y, it tends to suggest that Y is
 more important than X. If Y also defines itself in terms of X, then it
 tends to suggest they are equivalent in importance at some way. Indeed I
 suspect when we get to the VERY IMPORTANT topics we will see this kind of
 circular definition (e.g. you see circular definitions in Wikipedia around
 Philosophy and Knowledge). Aside, if you have never done this before, try
 this experiment. Choose a random article (left hand tool bar in Desktop
 Wikipedia), then click the first link in the article that matters (i.e.
 ignore links hatnotes or links inside parentheses). Repeat this first link
 clicking and sooner or later you will reach articles like Knowledge and
 Philosophy, which all sit inside circular definition groups.

 If we look at the Donald Trump article, his first sentence contains only
 two links, one to List of Presidents of the USA and the other to President
 of the USA. If we look at the those two articles, we find that both of them
 mention Donald Trump in their lede paras (although not as early as the
 first sentence) and before mentions of any other US President elsewhere in
 the article. Which is consistent with what we know about the real world,
 the role of the President is more important than its officeholders and that
 the current officeholder has more importance than a past officeholder. So
 topic importance does seems to be skewed towards the "present day".

 So I suspect the links in the lede paras are of greater relevance to the
 assessment of importance than links further down in the article which will
 be more likely relate to details of a topic and may include examples and
 counter-examples (this is a way in which high importance article may
 mention much lower importance articles). However, we do have to be a little
 bit careful here because of the MoS practice of not linking very common
 terms. For example, an Australian article will often refer to Australia in
 the lede para but it will almost certainly not be linked to the Australia
 article (and any attempt to add such a link will likely see it removed with
 an edit summary that mentions [[WP:Overlinking]]) whereas there is no
 problem if you link to an Australian state article, e.g. New South Wales.
 So we might find that some very important topics that often appear in ledes
 might get fewer links that you might expect because of the MoS policies on
 overlinking, which may be problem when working with inbound links. It may
 be that for "very common topics" the presence of the article title (or its
 synonyms) in the lede may have to be considered as if it were an in-bound
 link for statistical research purposes.

 Given all of the above, perhaps the most interesting group of articles to
 study in Wikipedia are those articles whose manually-assessed importance
 has changed over the life of the article AND which were NOT current topics
 in the lifetime of Wikipedia (given the influence of "current" on
 importance). But having said that, I wonder if that group of articles
 actually exists. Recently a newish Australian contributor expressed
 disappointment that all the new articles they had created were tagged (by
 others) as of Low Importance. My instinctive reply was "that's normal, I
 think of the thousands of articles I have started only a couple even rated
 as Mid importance, this is because the really important articles were all
 started long ago precisely because they were important". I suspect topics
 that are very important (for reasons other than being short-lived
 importance due in being "current" in the lifetime of Wikipedia) will
 generally show up as having started early in Wikipedia's life and that
 those that become more/less important over time will be largely linked to
 becoming or ceasing to be "current" topics). E.g. article Pasteurization
 started in May 2001 saying nothing more than " Pasteurization is the
 process of killing off bacteria in milk by quickly heating it to a near
 boiling temperature, then quickly cooling it again before the taste and
 other desirable properties are affected. The process was named after its
 inventor, French scientist Louis Pasteur. See also dairy products." The
 links in this very first version are still present in its lede paragraph
 today, suggesting our understanding of "non-current" topics is stable and
 hence initial importance determinations can probably be accurately made.
 For Pasteurization the Talk page shows it was not project-tagged until 2007
 when it was assigned High Importance as its first assessment.

 I suspect we will find that initial manual assessment of article
 importance will be pretty accurate for most articles. And I suspect if we
 plot initial importance assessments against time of assessment, we will
 find the higher importance articles commenced life on Wikipedia earlier
 than the lower importance articles. If I am correct, then there isn't a lot
 of value in machine-assessment of importance of topics because it relates
 to factors external to Wikipedia and often does not change over time and
 therefore can often be correctly assessed manually even on new stub
 articles (and any unassessed articles can probably be rated as Low
 Importance as statistically that's almost certainly going to be correct).
 If a topic becomes more important due to "current" events, then invariably
 that article will be updated by many people and one of them will sooner or
 later manually adjust its importance. What is less likely to happen is
 re-assessing downwards of Importance when an important "current" topic
 loses its importance when it is no longer current, e.g. are former American
 presidents like Barack Obama or George W Bush or further back less
 important now? These articles will not be updated frequently once the topic
 is no longer in the news and therefore it is less likely an editor will
 notice and manually downgrade the importance, so there may be a greater
 role for machine-assessment in downgrading importance rather than upgrading
 importance.

 Another area where there might be a role for machine-assessed importance
 in regards to POV-pushing where an POV-motivated editor might change the
 manual-assessment importance of articles to be higher or lower based on
 their POV (e.g. my political party is Top Importance, other parties are of
 Low Importance). I suspect that often a page watcher would correct or at
 least question that kind of re-assessment. However, articles with few
 active pagewatchers you might get away with POV-pushing the article's
 importance tag because nobody noticed. In this situation, a machine
 assessment could be useful in spotting this kind of thing.

 This suggests that another metric of interest to importance might be
 number of pagewatchers, although I suspect that pagewatching may relate
 more to caring about the article than to caring about the topic. And one
 has to be careful to distinguish active pagewatchers (those who actually do
 review changes on their watchlists) from those who don't, as that may make
 a difference (although I am not sure we can really tell which pagewatchers
 are truly actively reviewing as a "satisfactory review" doesn't leave a
 trace whereas an "unsatisfactory" review is likely to lead to a relatively
 soon revert or some other change to the article, the article Talk or the
 User Talk of reviewed contributor which may be detectable).

 The other aspect of articles that occurs to me as being possibly linked to
 importance of the topic would be use of the article as the "main" article
 for a category or as the title of a navbox (as it suggests that the
 articles in the category or navbox are in some way subordinate to the
 main/title article). Similarly for list articles, the "type" of the list is
 often more important than its instances).

 Kerry

 -----Original Message-----
 From: Wiki-research-l [mailto:wiki-research-l-bounces@lists.wikimedia.org]
 On Behalf Of Morten Wang
 Sent: Friday, 21 April 2017 6:04 AM
 To: Research into Wikimedia content and communities <
 wiki-research-l(a)lists.wikimedia.org&gt;
 Subject: Re: [Wiki-research-l] Project exploring automated classification
 of article importance

 Hi Pine,

 These are great pointers to existing practices on enwiki, some of which
 I've been looking for and/or missed, thanks!

 Cheers,
 Morten

 On 19 April 2017 at 22:35, Pine W &lt;wiki.pine(a)gmail.com&gt; wrote:

  Hi Nettrom,

 A few resources from English Wikipedia regarding article importance as
 ranked by humans:

 https://en.wikipedia.org/wiki/Wikipedia:Vital_articles

 https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_
 Editorial_Team/Release_Version_Criteria#Priority_of_topic

 https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Statist
 ics

 I infer from the ENWP Wikicup's scoring protocol that for purposes of
 the competition, an article's "importance" is loosely inferred from
 the number of language editions of Wikipedia in which the article  appears:

https://en.wikipedia.org/wiki/Wikipedia:WikiCup/Scoring#Bonus_points.

 HTH,

 Pine

 On Tue, Apr 18, 2017 at 4:17 PM, Morten Wang &lt;nettrom(a)gmail.com&gt; wrote:

  Hello everyone,

 I am currently working with Aaron Halfaker and Dario Taraborelli at
 the Wikimedia Foundation on a project exploring automated
 classification of article importance. Our goal is to characterize
 the importance of an article within a given context and design a
 system to predict a relative importance rank. We have a project page
 on meta[1] and welcome comments  or
  thoughts on our talk page. You can of course also
respond here on
 wiki-research-l, or send me an email.

 Before moving on to model-building I did a fairly thorough
 literature review, finding a myriad of papers spanning several
 disciplines. We have  a
  draft literature review also up on meta[2], which
should give you a
 reasonable introduction to the topic. Again, comments or thoughts (e.g.
 papers we’ve missed) on the talk page, mailing list, or through
 email are welcome.

 Links:

    1. https://meta.wikimedia.org/wiki/Research:Automated_
    classification_of_article_importance
    <https://meta.wikimedia.org/wiki/Research:Automated_
 classification_of_article_importance>
    2. https://meta.wikimedia.org/wiki/Research:Studies_of_Importance

 Regards,
 Morten
 [[User:Nettrom]] aka [[User:SuggestBot]]
 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
  _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
  _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] Project exploring automated classification of article importance