Sorry if I seemed negative! I am just responding to your comments in the
same way I have been trying to decide how to measure stuff to enable my
wikiprojects to move forward. This is very frustrating stuff! I also agree
that editor activity is probably a very good way to measure all sorts of
things, and it just seems sad that any attempts in this area seem to come
up hard against a wall of "privacy issues". Privacy is also linked to
ownership, and as it stands now, Wikipedia editors still own their own
words & media, which means we can't let go of the cc-by-sa licensing yet. I
do agree however that we should move to a model of "cc0 by default" rather
than "cc-by-sa" by default. Most people don't care and if you explain the
difference they are surprised that there is an option that is more open
than the one they thought they were using. We can't retroactively make
cc-by-sa turn into cc-0 without the consent of the original
uploader/writers, but we can try to get documents and data released cc0 in
Wikisource and more cc0 material uploaded to Commons!
On Wed, Apr 26, 2017 at 2:32 PM, Kerry Raymond <kerry.raymond(a)gmail.com>
wrote:
I think you are reading my comments too negatively.
I’m not saying to
ignore pageviews or incoming links. I’m saying that a naïve look at their
stats may not be as useful as some of the variations I mention. I think it
is worth looking at pageviews relative to those articles in the same
WikiProject. I think it is worth looking at inbound links but to consider
two groups, those coming from the same WikiProject(s) and from other
WikiProjects. I think the position of the incoming links within their
source articles is also significant, either first sentence, first para,
whole of lede, or absolute/relative position of the link in the article
(e.g. 2000 bytes from start, or 40% from start).
The big difference between machine-assessment of article quality and
article importance is that quality is a metric on the article but
importance is a metric on the topic. Also, my informal observation is that
article quality does improve and degrade over time and hence is much more
dynamic than topic importance, which seems to me to be much more stable. So
I think there is less scope for dramatically improving the situation by
being able to determine topic importance than the benefits likely to be
achieved from automated quality assessment, but there may be benefit if
there are heuristics to spot the relatively few articles which do need
importance re-assessed due to “current events”. In which case “editor
activity” may be a metric, particularly “editor activity” on the lede para
or other more critical areas of the article.
I am not too worried about 22nd century. I think we should look more at
the next decade. Who would have predicted the demise of Usenet? It seemed
pretty sexy at the time, etc. Wikipedia, like many things, will pass. It’s
not to say it will pass into oblivion but it may morph into something very
different to what it is today. Being CC-BY-SA improves the chances that any
successor can build on it, but maybe we should put into WMF’s constitution,
“if WMF shuts down, we release the contents of the projects as CC0” (to
increase the likelihood that the content has a future). Having had to shut
down a number of research institutes when the funding ran out, I know the
utter stupidity occurs when they retain a skeleton of staff to “sell off
all our valuable IP” which every closing-down institution seems to wants to
do and the result is that the IP gets wasted because it isn’t sold or it’s
sold to one of those companies who buy IP for tuppence on the off-chance
they can potentially engage in patent litigation (or other IP litigation)
downstream. We waste so much IP with this kind of “make a buck” thinking.
<end of rant>
Kerry
*From:* Jane Darnell [mailto:jane023@gmail.com]
*Sent:* Wednesday, 26 April 2017 5:51 PM
*To:* kerry.raymond(a)gmail.com; Research into Wikimedia content and
communities <wiki-research-l(a)lists.wikimedia.org>
*Subject:* Re: [Wiki-research-l] Project exploring automated
classification of article importance
Yes I totally agree that "importance is a relative metric rather than
absolute." I also agree that incoming links and pageviews are not accurate
measurements of "importance" for all of the reasons you mention. However,
we are still a project that is actively exploring the universe of
knowledge, and leaning heavily on academia and other established sources we
must "boldly go where no man has gone before" (and please feel free to
insert "white, euro-centric" before the man part). So do you have any
suggestions what we could measure going forward that would cough up some
interesting stats to monitor? Pagewatching is useful , but problematic
because these are only assigned at page-creation, while some marginal
editor interest might be expanded to whole categories (speaking as someone
who has thousands of pages watchlisted on multiple projects). I like your
thoughts about looking for key articles such as those used as the "article
as the "main" article for a category or as the title of a navbox ". I am
looking for similar usages of paintings as a way to find popular painters
or paintings rather than just those paintings which have articles written
about them (which are often written for totally random reasons such as
theft/sale/wikiproject).
On Wed, Apr 26, 2017 at 5:39 AM, Kerry Raymond <kerry.raymond(a)gmail.com>
wrote:
Just a few musings on the issue of Importance and how to research it ...
I agree it is intuitive that importance is likely to be linked to
pageviews and inbound links but, as the preliminary experiment showed, it's
probably not that simple.
Pageviews tells us something about importance to readers of Wikipedia,
while inbound links tells us something about importance to writers of
Wikipedia, and I suspect that writers are not a proxy for readers as the
editor surveys suggest that Wikipedia writers are not typical of broader
society on at least two variables: gender and level of education (might be
others, I can't remember).
But I think importance is a relative metric rather than absolute. I think
by taking the mean value of importance across a number of WikiProjects in
the preliminary experiment may have lost something because it tried
(through averaging) to look at importance "generally". I would suspect
conducting an experiment considering only the importance ratings wrt to a
single WikiProject would be more likely to show correlation with pageviews
(wrt to other articles in that same WikiProject) and inbound links. And I
think there are two kinds of inbound links to be considered, those coming
from other articles within the same WikiProject and those coming from
outside that Wikiproject. I suspect different insights will be obtained by
looking at both types of inbound links separately rather than treating them
as an aggregate. I note also that WikiProjects are not entirely independent
of one another but have relationships between them. For example, The
WikiProject Australian Roads describes itself as an "intersection" (ha ha!)
of WikiProject Highways and WikiProject Australia, so I expect that we
would find greater correlation in importance between related WikiProjects
than between unrelated WikiProjects.
When thinking about readers and pageviews, I think we have to ask
ourselves is there a difference between popularity and importance. Or
whether popularity *is* importance. I sense that, as a group of educated
people, those of us reading this research mailing list probably do think
there is a difference. Certainly if there is no difference, then this
research can stop now -- just judge importance by pageviews. Let's assume
a difference then. When looking at pageviews of an article, they are not
always consistent over time. Here are the pageviews for Drottninggatan
https://tools.wmflabs.org/pageviews/?project=en.
wikipedia.org&platform=all-access&agent=user&range=
latest-90&pages=Drottninggatan
Why so interesting on 8 April? A terrorist attack occurred there. This
spike in pageviews occurs all the time when some topic is in the news (even
peripherally as in this case where it is not the article about the
terrorist attack but about the street in which it occurred). Did the street
become more "important"? I think it became more interesting but not more
important. So I think we do have to be careful to understand that pageviews
probably reflect interest rather than importance. I note that The
Chainsmokers (a music group with a number of songs in the current USA music
charts) gets many more Wikipedia article pageviews than the Wikipedia
article on Pasteurization but The Chainsmokers are not rated as being of
high importance by the relevant WikiProjects while Pasteurization is very
important in WikiProject Food and Drink. Since pasteurisation prevents a
lot of deaths, I think we might agree that in the real world pasteurisation
is more important than a music group regardless of what pageviews tell us.
https://tools.wmflabs.org/pageviews/?project=en.
wikipedia.org&platform=all-access&agent=user&range=latest-90&pages=The_
Chainsmokers|Pasteurization
Of course it is matters for Wikipedia's success that our *popular*
articles are of high quality, but I think we have be cautious about
pageviews being a proxy for importance.
When we look at Wikipedia writers' decisions in tagging the importance of
articles to WikiProjects, what do we find? As we know, project tags are
often placed on new articles (and often not subsequently reviewed). So
while I find that quality tags are often out-of-date, the importance seems
to be pretty accurate even on a new stub articles. This is because it is
the importance of the *topic* that is being assessed which is independent
of the Wikipedia article itself. Provided the article is clear enough about
what it is about and why it matters (which is the traditional content of
that first paragraph or two and failing to provide it will likely result in
speedy deletion of the new article), assessment of the topic's importance
can be made even at new stub level. This tells us that importance for
Wikipedia writers is determined by something outside of Wikipedia (probably
their real-world knowledge of that topic space -- one assumes that project
taggers are quite interested in the topic space of that project). While
article quality hopefully improves over time, I would be surprised if
article importance greatly changed over time. Obviously there are
counter-examples. I am guessing Donald Trump's article may have grown in
importance over time but that's probably because his lede para changed.
Adding President of the USA into the lede paragraph makes him much more
important than he was before in the real world and internal to Wikipedia he
has acquired an inbound link from the presumably high-importance President
of the USA article. So I think it might be interesting to study those
articles whose importance does change over time to see if there are any
strong correlations with what is happening to the article inside Wikipedia.
I think it is this set of importance-changing articles may be where we
really learn what Wikipedia article characteristics are strongly correlated
to "importance" given that importance itself appears to be pretty stable
for most articles.
Although not stated explicitly, I imagine we believe that generally less
important articles tend to link to more important articles but more
important articles don't link to less important articles. And hence
in-bound links are likely to matter in assessing importance and that
in-bound links from "important" articles are more valuable than in-bound
links from less important articles (which creates something of a
bootstrapping problem) similar to the issue to Google's PageRank
algorithms. But I think we do have some information that Google doesn't
have. The average webpage does not have a lede paragraph that situates the
topic relative to other topics; a Wikipedia article does. If I have to
choose to define Thing X in terms of Thing Y, it tends to suggest that Y is
more important than X. If Y also defines itself in terms of X, then it
tends to suggest they are equivalent in importance at some way. Indeed I
suspect when we get to the VERY IMPORTANT topics we will see this kind of
circular definition (e.g. you see circular definitions in Wikipedia around
Philosophy and Knowledge). Aside, if you have never done this before, try
this experiment. Choose a random article (left hand tool bar in Desktop
Wikipedia), then click the first link in the article that matters (i.e.
ignore links hatnotes or links inside parentheses). Repeat this first link
clicking and sooner or later you will reach articles like Knowledge and
Philosophy, which all sit inside circular definition groups.
If we look at the Donald Trump article, his first sentence contains only
two links, one to List of Presidents of the USA and the other to President
of the USA. If we look at the those two articles, we find that both of them
mention Donald Trump in their lede paras (although not as early as the
first sentence) and before mentions of any other US President elsewhere in
the article. Which is consistent with what we know about the real world,
the role of the President is more important than its officeholders and that
the current officeholder has more importance than a past officeholder. So
topic importance does seems to be skewed towards the "present day".
So I suspect the links in the lede paras are of greater relevance to the
assessment of importance than links further down in the article which will
be more likely relate to details of a topic and may include examples and
counter-examples (this is a way in which high importance article may
mention much lower importance articles). However, we do have to be a little
bit careful here because of the MoS practice of not linking very common
terms. For example, an Australian article will often refer to Australia in
the lede para but it will almost certainly not be linked to the Australia
article (and any attempt to add such a link will likely see it removed with
an edit summary that mentions [[WP:Overlinking]]) whereas there is no
problem if you link to an Australian state article, e.g. New South Wales.
So we might find that some very important topics that often appear in ledes
might get fewer links that you might expect because of the MoS policies on
overlinking, which may be problem when working with inbound links. It may
be that for "very common topics" the presence of the article title (or its
synonyms) in the lede may have to be considered as if it were an in-bound
link for statistical research purposes.
Given all of the above, perhaps the most interesting group of articles to
study in Wikipedia are those articles whose manually-assessed importance
has changed over the life of the article AND which were NOT current topics
in the lifetime of Wikipedia (given the influence of "current" on
importance). But having said that, I wonder if that group of articles
actually exists. Recently a newish Australian contributor expressed
disappointment that all the new articles they had created were tagged (by
others) as of Low Importance. My instinctive reply was "that's normal, I
think of the thousands of articles I have started only a couple even rated
as Mid importance, this is because the really important articles were all
started long ago precisely because they were important". I suspect topics
that are very important (for reasons other than being short-lived
importance due in being "current" in the lifetime of Wikipedia) will
generally show up as having started early in Wikipedia's life and that
those that become more/less important over time will be largely linked to
becoming or ceasing to be "current" topics). E.g. article Pasteurization
started in May 2001 saying nothing more than " Pasteurization is the
process of killing off bacteria in milk by quickly heating it to a near
boiling temperature, then quickly cooling it again before the taste and
other desirable properties are affected. The process was named after its
inventor, French scientist Louis Pasteur. See also dairy products." The
links in this very first version are still present in its lede paragraph
today, suggesting our understanding of "non-current" topics is stable and
hence initial importance determinations can probably be accurately made.
For Pasteurization the Talk page shows it was not project-tagged until 2007
when it was assigned High Importance as its first assessment.
I suspect we will find that initial manual assessment of article
importance will be pretty accurate for most articles. And I suspect if we
plot initial importance assessments against time of assessment, we will
find the higher importance articles commenced life on Wikipedia earlier
than the lower importance articles. If I am correct, then there isn't a lot
of value in machine-assessment of importance of topics because it relates
to factors external to Wikipedia and often does not change over time and
therefore can often be correctly assessed manually even on new stub
articles (and any unassessed articles can probably be rated as Low
Importance as statistically that's almost certainly going to be correct).
If a topic becomes more important due to "current" events, then invariably
that article will be updated by many people and one of them will sooner or
later manually adjust its importance. What is less likely to happen is
re-assessing downwards of Importance when an important "current" topic
loses its importance when it is no longer current, e.g. are former American
presidents like Barack Obama or George W Bush or further back less
important now? These articles will not be updated frequently once the topic
is no longer in the news and therefore it is less likely an editor will
notice and manually downgrade the importance, so there may be a greater
role for machine-assessment in downgrading importance rather than upgrading
importance.
Another area where there might be a role for machine-assessed importance
in regards to POV-pushing where an POV-motivated editor might change the
manual-assessment importance of articles to be higher or lower based on
their POV (e.g. my political party is Top Importance, other parties are of
Low Importance). I suspect that often a page watcher would correct or at
least question that kind of re-assessment. However, articles with few
active pagewatchers you might get away with POV-pushing the article's
importance tag because nobody noticed. In this situation, a machine
assessment could be useful in spotting this kind of thing.
This suggests that another metric of interest to importance might be
number of pagewatchers, although I suspect that pagewatching may relate
more to caring about the article than to caring about the topic. And one
has to be careful to distinguish active pagewatchers (those who actually do
review changes on their watchlists) from those who don't, as that may make
a difference (although I am not sure we can really tell which pagewatchers
are truly actively reviewing as a "satisfactory review" doesn't leave a
trace whereas an "unsatisfactory" review is likely to lead to a relatively
soon revert or some other change to the article, the article Talk or the
User Talk of reviewed contributor which may be detectable).
The other aspect of articles that occurs to me as being possibly linked to
importance of the topic would be use of the article as the "main" article
for a category or as the title of a navbox (as it suggests that the
articles in the category or navbox are in some way subordinate to the
main/title article). Similarly for list articles, the "type" of the list is
often more important than its instances).
Kerry
-----Original Message-----
From: Wiki-research-l [mailto:wiki-research-l-bounces@lists.wikimedia.org]
On Behalf Of Morten Wang
Sent: Friday, 21 April 2017 6:04 AM
To: Research into Wikimedia content and communities <
wiki-research-l(a)lists.wikimedia.org>
Subject: Re: [Wiki-research-l] Project exploring automated classification
of article importance
Hi Pine,
These are great pointers to existing practices on enwiki, some of which
I've been looking for and/or missed, thanks!
Cheers,
Morten
On 19 April 2017 at 22:35, Pine W <wiki.pine(a)gmail.com> wrote:
Hi Nettrom,
A few resources from English Wikipedia regarding article importance as
ranked by humans:
https://en.wikipedia.org/wiki/Wikipedia:Vital_articles
https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_
Editorial_Team/Release_Version_Criteria#Priority_of_topic
https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Statist
ics
I infer from the ENWP Wikicup's scoring protocol that for purposes of
the competition, an article's "importance" is loosely inferred from
the number of language editions of Wikipedia in which the article
appears:
https://en.wikipedia.org/wiki/Wikipedia:WikiCup/Scoring#Bonus_points.
HTH,
Pine
On Tue, Apr 18, 2017 at 4:17 PM, Morten Wang <nettrom(a)gmail.com> wrote:
Hello everyone,
I am currently working with Aaron Halfaker and Dario Taraborelli at
the Wikimedia Foundation on a project exploring automated
classification of article importance. Our goal is to characterize
the importance of an article within a given context and design a
system to predict a relative importance rank. We have a project page
on meta[1] and welcome comments
or
thoughts on our talk page. You can of course also
respond here on
wiki-research-l, or send me an email.
Before moving on to model-building I did a fairly thorough
literature review, finding a myriad of papers spanning several
disciplines. We have
a
draft literature review also up on meta[2], which
should give you a
reasonable introduction to the topic. Again, comments or thoughts (e.g.
papers we’ve missed) on the talk page, mailing list, or through
email are welcome.
Links:
1.
https://meta.wikimedia.org/wiki/Research:Automated_
classification_of_article_importance
<https://meta.wikimedia.org/wiki/Research:Automated_
classification_of_article_importance>
2.
https://meta.wikimedia.org/wiki/Research:Studies_of_Importance
Regards,
Morten
[[User:Nettrom]] aka [[User:SuggestBot]]
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l