Re: [Wiki-research-l] Project exploring automated classification of article importance

26 Apr 2017

I think you are reading my comments too negatively. I’m not saying to ignore pageviews or
incoming links. I’m saying that a naïve look at their stats may not be as useful as some
of the variations I mention. I think it is worth looking at pageviews relative to those
articles in the same WikiProject. I think it is worth looking at inbound links but to
consider two groups, those coming from the same WikiProject(s) and from other
WikiProjects. I think the position of the incoming links within their source articles is
also significant, either first sentence, first para, whole of lede, or absolute/relative
position of the link in the article (e.g. 2000 bytes from start, or 40% from start).

The big difference between machine-assessment of article quality and article importance is
that quality is a metric on the article but importance is a metric on the topic. Also, my
informal observation is that article quality does improve and degrade over time and hence
is much more dynamic than topic importance, which seems to me to be much more stable. So I
think there is less scope for dramatically improving the situation by being able to
determine topic importance than the benefits likely to be achieved from automated quality
assessment, but there may be benefit if there are heuristics to spot the relatively few
articles which do need  importance re-assessed due to “current events”. In which case
“editor activity” may be a metric, particularly “editor activity” on the lede para or
other more critical areas of the article.

I am not too worried about 22nd century. I think we should look more at the next decade.
Who would have predicted the demise of Usenet? It seemed pretty sexy at the time, etc.
Wikipedia, like many things, will pass. It’s not to say it will pass into oblivion but it
may morph into something very different to what it is today. Being CC-BY-SA improves the
chances that any successor can build on it, but maybe we should put into WMF’s
constitution, “if WMF shuts down, we release the contents of the projects as CC0” (to
increase the likelihood that the content has a future). Having had to shut down a number
of research institutes when the funding ran out, I know the utter stupidity occurs when
they retain a skeleton of staff to “sell off all our valuable IP” which every closing-down
institution seems to wants to do and the result is that the IP gets wasted because it
isn’t sold or it’s sold to one of those companies who buy IP for tuppence on the
off-chance they can potentially engage in patent litigation (or other IP litigation)
downstream. We waste so much IP with this kind of “make a buck” thinking. <end of
rant>

Kerry

From: Jane Darnell [mailto:jane023@gmail.com] 
Sent: Wednesday, 26 April 2017 5:51 PM
To: kerry.raymond(a)gmail.com; Research into Wikimedia content and communities
&lt;wiki-research-l(a)lists.wikimedia.org&gt;
Subject: Re: [Wiki-research-l] Project exploring automated classification of article
importance

Yes I totally agree that "importance is a relative metric rather than absolute."
I also agree that incoming links and pageviews are not accurate measurements of
"importance" for all of the reasons you mention. However, we are still a project
that is actively exploring the universe of knowledge, and leaning heavily on academia and
other established sources we must "boldly go where no man has gone before" (and
please feel free to insert "white, euro-centric" before the man part). So do you
have any suggestions what we could measure going forward that would cough up some
interesting stats to monitor? Pagewatching is useful , but problematic because these are
only assigned at page-creation, while some marginal editor interest might be expanded to
whole categories (speaking as someone who has thousands of pages watchlisted on multiple
projects). I like your thoughts about looking for key articles such as those used as the
"article as the "main" article for a category or as the title of a navbox
".  I am looking for similar usages of paintings as a way to find popular painters or
paintings rather than just those paintings which have articles written about them (which
are often written for totally random reasons such as theft/sale/wikiproject).

On Wed, Apr 26, 2017 at 5:39 AM, Kerry Raymond &lt;kerry.raymond(a)gmail.com
<mailto:kerry.raymond@gmail.com> > wrote:

Just a few musings on the issue of Importance and how to research it ...

I agree it is intuitive that importance is likely to be linked to pageviews and inbound
links but, as the preliminary experiment showed, it's probably not that simple.

Pageviews tells us something about importance to readers of Wikipedia, while inbound links
tells us something about importance to writers of Wikipedia, and I suspect that writers
are not a proxy for readers as the editor surveys suggest that Wikipedia writers are not
typical of broader society on at least two variables: gender and level of education (might
be others, I can't remember).

But I think importance is a relative metric rather than  absolute. I think by taking the
mean value of importance across a number of WikiProjects in the preliminary experiment may
have lost something because it tried (through averaging) to look at importance
"generally". I would suspect conducting an experiment considering only the
importance ratings wrt to a single WikiProject would be more likely to show correlation
with pageviews (wrt to other articles in that same WikiProject) and inbound links. And I
think there are two kinds of inbound links to be considered, those coming from other
articles within the same WikiProject and those coming from outside that Wikiproject. I
suspect different insights will be obtained by looking at both types of inbound links
separately rather than treating them as an aggregate. I note also that WikiProjects are
not entirely independent of one another but have relationships between them. For example,
The WikiProject Australian Roads describes itself as an "intersection" (ha ha!)
of WikiProject Highways and WikiProject Australia, so I expect that we would find greater
correlation in importance between related WikiProjects than between unrelated
WikiProjects.

When thinking about readers and pageviews, I think we have to ask ourselves is there a
difference between popularity and importance. Or whether popularity *is* importance. I
sense that, as a group of educated people, those of us reading this research mailing list
probably do think there is a difference. Certainly if there is no difference, then this
research can stop now -- just judge importance by  pageviews. Let's assume a
difference then. When looking at pageviews of an article, they are not always consistent
over time. Here are the pageviews for Drottninggatan

https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org
<https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-90&pages=Drottninggatan>
&platform=all-access&agent=user&range=latest-90&pages=Drottninggatan

Why so interesting on 8 April? A terrorist attack occurred there. This spike in pageviews
occurs all the time when some topic is in the news (even peripherally as in this case
where it is not the article about the terrorist attack but about the street in which it
occurred). Did the street become more "important"? I think it became more
interesting but not more important. So I think we do have to be careful to understand that
pageviews probably reflect interest rather than importance.  I note that The Chainsmokers
(a music group with a number of songs in the current USA music charts) gets many more
Wikipedia article pageviews  than the Wikipedia article on Pasteurization but The
Chainsmokers are not rated as being of high importance by the relevant WikiProjects while
Pasteurization is very important in WikiProject Food and Drink. Since pasteurisation
prevents a lot of deaths, I think we might agree that in the real world pasteurisation is
more important than a music group regardless of what pageviews tell us.

https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org
<https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-90&pages=The_Chainsmokers|Pasteurization>
&platform=all-access&agent=user&range=latest-90&pages=The_Chainsmokers|Pasteurization

Of course it is matters for Wikipedia's success that our *popular* articles are of
high quality, but I think we have be cautious about pageviews being a proxy for
importance.

When we look at Wikipedia writers' decisions in tagging the importance of articles to
WikiProjects, what do we find? As we know, project tags are often placed on new articles
(and often not subsequently reviewed). So while I find that quality tags are often
out-of-date, the importance seems to be pretty accurate even on a new stub articles. This
is because it is the importance of the *topic* that is being assessed which is independent
of the Wikipedia article itself. Provided the article is clear enough about what it is
about and why it matters (which is the traditional content of that first paragraph or two
and failing to provide it will likely result in speedy deletion of the new article),
assessment of the topic's importance can be made even at new stub level. This tells us
that importance for Wikipedia writers is determined by something outside of Wikipedia
(probably their real-world knowledge of that topic space -- one assumes that project
taggers are quite interested in the topic space of that project). While article quality
hopefully improves over time, I would be surprised if article importance greatly changed
over time. Obviously there are counter-examples.  I am guessing Donald Trump's article
may have grown in importance over time but that's probably because his lede para
changed. Adding President of the USA into the lede paragraph makes him much more important
than he was before in the real world and internal to Wikipedia he has acquired an inbound
link from the presumably high-importance President of the USA article. So I think it might
be interesting to study those articles whose importance does change over time to see if
there are any strong correlations with what is happening to the article inside Wikipedia.
I think it is this set of importance-changing articles may be where we really learn what
Wikipedia article characteristics are strongly correlated to "importance" given
that importance itself appears to be pretty stable for most articles.

Although not stated explicitly, I imagine we believe that generally less important
articles tend to link to more important articles but more important articles don't
link to less important articles. And hence in-bound links are likely to matter in
assessing importance and that in-bound links from "important" articles are more
valuable than in-bound links from less important articles (which creates something of a
bootstrapping problem) similar to the issue to Google's PageRank algorithms. But I
think we do have some information that Google doesn't have. The average webpage does
not have a lede paragraph that situates the topic relative to other topics; a Wikipedia
article does. If I have to choose to define Thing X in terms of Thing Y, it tends to
suggest that Y is more important than X. If Y also defines itself in terms of X, then it
tends to suggest they are equivalent in importance at some way. Indeed I suspect when we
get to the VERY IMPORTANT topics we will see this kind of circular definition (e.g. you
see circular definitions in Wikipedia around Philosophy and Knowledge). Aside, if you have
never done this before, try this experiment. Choose a random article (left hand tool bar
in Desktop Wikipedia), then click the first link in the article that matters (i.e. ignore
links hatnotes or links inside parentheses). Repeat this first link clicking and sooner or
later you will reach articles like Knowledge and Philosophy, which all sit inside circular
definition groups.

If we look at the Donald Trump article, his first sentence contains only two links, one to
List of Presidents of the USA and the other to President of the USA. If we look at the
those two articles, we find that both of them mention Donald Trump in their lede paras
(although not as early as the first sentence) and before mentions of any other US
President elsewhere in the article. Which is consistent with what we know about the real
world, the role of the President is more important than its officeholders and that the
current officeholder has more importance than a past officeholder. So topic importance
does seems to be skewed towards the "present day".

So I suspect the links in the lede paras are of greater relevance to the assessment of
importance than links further down in the article which will be more likely relate to
details of a topic and may include examples and counter-examples (this is a way in which
high importance article may mention much lower importance articles). However, we do have
to be a little bit careful here because of the MoS practice of not linking very common
terms. For example, an Australian article will often refer to Australia in the lede para
but it will almost certainly not be linked to the Australia article (and any attempt to
add such a link will likely see it removed with an edit summary that mentions
[[WP:Overlinking]]) whereas there is no problem if you link to an Australian state
article, e.g. New South Wales. So we might find that some very important topics that often
appear in ledes might get fewer links that you might expect because of the MoS policies on
overlinking, which may be problem when working with inbound links. It may be that for
"very common topics" the presence of the article title (or its synonyms) in the
lede may have to be considered as if it were an in-bound link for statistical research
purposes.

Given all of the above, perhaps the most interesting group of articles to study in
Wikipedia are those articles whose manually-assessed importance has changed over the life
of the article AND which were NOT current topics in the lifetime of Wikipedia (given the
influence of "current" on importance). But having said that, I wonder if that
group of articles actually exists. Recently a newish Australian contributor expressed
disappointment that all the new articles they had created were tagged (by others) as of
Low Importance. My instinctive reply was "that's normal, I think of the thousands
of articles I have started only a couple even rated as Mid importance, this is because the
really important articles were all started long ago precisely because they were
important". I suspect topics that are very important (for reasons other than being
short-lived importance due in being "current" in the lifetime of Wikipedia) will
generally show up as having started early in Wikipedia's life and that those that
become more/less important over time will be largely linked to becoming or ceasing to be
"current" topics). E.g. article Pasteurization started in May 2001 saying
nothing more than " Pasteurization is the process of killing off bacteria in milk by
quickly heating it to a near boiling temperature, then quickly cooling it again before the
taste and other desirable properties are affected. The process was named after its
inventor, French scientist Louis Pasteur. See also dairy products." The links in this
very first version are still present in its lede paragraph today, suggesting our
understanding of "non-current" topics is stable and hence initial importance
determinations can probably be accurately made. For Pasteurization the Talk page shows it
was not project-tagged until 2007 when it was assigned High Importance as its first
assessment.

I suspect we will find that initial manual assessment of article importance will be pretty
accurate for most articles. And I suspect if we plot initial importance assessments
against time of assessment, we will find the higher importance articles commenced life on
Wikipedia earlier than the lower importance articles. If I am correct, then there
isn't a lot of value in machine-assessment of importance of topics because it relates
to factors external to Wikipedia and often does not change over time and therefore can
often be correctly assessed manually even on new stub articles (and any unassessed
articles can probably be rated as Low Importance as statistically that's almost
certainly going to be correct). If a topic becomes more important due to
"current" events, then invariably that article will be updated by many people
and one of them will sooner or later manually adjust its importance. What is less likely
to happen is re-assessing downwards of Importance when an important "current"
topic loses its importance when it is no longer current, e.g. are former American
presidents like Barack Obama or George W Bush or further back less important now? These
articles will not be updated frequently once the topic is no longer in the news and
therefore it is less likely an editor will notice and manually downgrade the importance,
so there may be a greater role for machine-assessment in downgrading importance rather
than upgrading importance.

Another area where there might be a role for machine-assessed importance in regards to
POV-pushing where an POV-motivated editor might change the manual-assessment importance of
articles to be higher or lower based on their POV (e.g. my political party is Top
Importance, other parties are of Low Importance). I suspect that often a page watcher
would correct or at least question that kind of re-assessment. However, articles with few
active pagewatchers you might get away with POV-pushing the article's importance tag
because nobody noticed. In this situation, a machine assessment could be useful in
spotting this kind of thing.

This suggests that another metric of interest to importance might be number of
pagewatchers, although I suspect that pagewatching may relate more to caring about the
article than to caring about the topic. And one has to be careful to distinguish active
pagewatchers (those who actually do review changes on their watchlists) from those who
don't, as that may make a difference (although I am not sure we can really tell which
pagewatchers are truly actively reviewing as a "satisfactory review" doesn't
leave a trace whereas an "unsatisfactory" review is likely to lead to a
relatively soon revert or some other change to the article, the article Talk or the User
Talk of reviewed contributor which may be detectable).

The other aspect of articles that occurs to me as being possibly linked to importance of
the topic would be use of the article as the "main" article for a category or as
the title of a navbox (as it suggests that the articles in the category or navbox are in
some way subordinate to the main/title article). Similarly for list articles, the
"type" of the list is often more important than its instances).

Kerry

-----Original Message-----
From: Wiki-research-l [mailto:wiki-research-l-bounces@lists.wikimedia.org
<mailto:wiki-research-l-bounces@lists.wikimedia.org> ] On Behalf Of Morten Wang
Sent: Friday, 21 April 2017 6:04 AM
To: Research into Wikimedia content and communities
&lt;wiki-research-l(a)lists.wikimedia.org <mailto:wiki-research-l@lists.wikimedia.org>
>
Subject: Re: [Wiki-research-l] Project exploring automated classification of article
importance

Hi Pine,

These are great pointers to existing practices on enwiki, some of which I've been
looking for and/or missed, thanks!

Cheers,
Morten

On 19 April 2017 at 22:35, Pine W &lt;wiki.pine(a)gmail.com
<mailto:wiki.pine@gmail.com> > wrote:

...
  Hi Nettrom,

 A few resources from English Wikipedia regarding article importance as
 ranked by humans:

 https://en.wikipedia.org/wiki/Wikipedia:Vital_articles

 https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_
 Editorial_Team/Release_Version_Criteria#Priority_of_topic

 https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Statist
 ics

 I infer from the ENWP Wikicup's scoring protocol that for purposes of
 the competition, an article's "importance" is loosely inferred from
 the number of language editions of Wikipedia in which the article appears:
 https://en.wikipedia.org/wiki/Wikipedia:WikiCup/Scoring#Bonus_points.

 HTH,

 Pine

 On Tue, Apr 18, 2017 at 4:17 PM, Morten Wang &lt;nettrom(a)gmail.com
<mailto:nettrom@gmail.com> > wrote:

  Hello everyone,

 I am currently working with Aaron Halfaker and Dario Taraborelli at
 the Wikimedia Foundation on a project exploring automated
 classification of article importance. Our goal is to characterize
 the importance of an article within a given context and design a
 system to predict a relative importance rank. We have a project page
 on meta[1] and welcome comments  or
  thoughts on our talk page. You can of course also
respond here on
 wiki-research-l, or send me an email.

 Before moving on to model-building I did a fairly thorough
 literature review, finding a myriad of papers spanning several
 disciplines. We have  a
  draft literature review also up on meta[2], which
should give you a
 reasonable introduction to the topic. Again, comments or thoughts (e.g.
 papers we’ve missed) on the talk page, mailing list, or through
 email are welcome.

 Links:

    1. https://meta.wikimedia.org/wiki/Research:Automated_
    classification_of_article_importance
    <https://meta.wikimedia.org/wiki/Research:Automated_
 classification_of_article_importance>
    2. https://meta.wikimedia.org/wiki/Research:Studies_of_Importance

 Regards,
 Morten
 [[User:Nettrom]] aka [[User:SuggestBot]]
 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org <mailto:Wiki-research-l@lists.wikimedia.org> 
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
  _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org <mailto:Wiki-research-l@lists.wikimedia.org> 
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
 _______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org <mailto:Wiki-research-l@lists.wikimedia.org> 
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org <mailto:Wiki-research-l@lists.wikimedia.org> 
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] Project exploring automated classification of article importance