I observe (and am unsurprised) that WikiProject
Australia also rates the
Pavlova article as High importance, which demonstrates into the Stuart's
comments about graphs and subgraphs. If there are relationships between
WikiProjects, there is probably some correlation about importance of
articles as seen by those projects. As it happens, WikiProject Australia
and WikiProject New Zealand are related on Wikipedia only by both being
within the category "WikiProject Countries projects" (along with every
other national WikiProject), so this is an example where you cannot see the
connection between these projects "on-wiki" but anyone who knows anything
about the geography, history, and culture of the two countries will
understand the close connection (e.g. ANZAC, sheep, pavlova, rugby union)
but, as the project tagging will show, we do have our differences, e.g.
Whitebait is a High Importance article for NZ but Oz doesn't even tag it
(we don't share the NZ passion for these small fish). And perhaps more
seriously, our two countries have different indigenous peoples so our
project tagging around Maori (NZ) and Aboriginal and Torres Strait Islander
(Oz) articles would usually be quite disjoint.
So if there are correlations between project tagging, it may be something
exploitable in machine assessment of importance.
Kerry
-----Original Message-----
From: Wiki-research-l [mailto:wiki-research-l-bounces@lists.wikimedia.org]
On Behalf Of Stuart A. Yeates
Sent: Friday, 28 April 2017 6:18 AM
To: Research into Wikimedia content and communities <
wiki-research-l(a)lists.wikimedia.org>
Subject: Re: [Wiki-research-l] Project exploring automated classification
of article importance
On em.wiki article importance is relative to some wikiproject. This is
encoded in
Within Wikiproject New Zealand, there are articles which we think are very
important to us, which we would never argue are even marginally important
on a global scale. Take for example
For the mathematically inclined, this is a classic case of graph and many
subgraphs.
cheers
stuart
--
...let us be heard from red core to black sky
On 27 April 2017 at 21:44, Gerard Meijssen <gerard.meijssen(a)gmail.com>
wrote:
Hoi,
I have read the proposal and it leaves me wondering. Also the notion
of importance is indeed neither easy nor obvious. I think the question
what is most important is irrelevant depending on how you look at it.
Subject can be irrelevant when you look at it from a personal
perspective, looking at it from a particular perspective and indeed
what seems relevant may become irrelevant or relevant over time. When
you use metrics there will always be one way or another why it will be
found to be
problematic.
When you consider Wikipedia, the difference it makes with similar
resources is that its long tail is so much longer and still it is easy
and obvious to show how the English Wikipedia's long tail is not long
enough [1]. When you are looking for links and relevance, Wikidata
includes data on all Wikipedias and thereby more avenues to establish
relevance.
Research has been done that shows that when people are suggested to
write articles or amend articles, it works best when it is about
subjects they care about. What people are interested in was based in
the research on past behaviour. What we could do is flip this and ask
people. Based on categories, on projects, whatever people do to
categorise what is their interest. This will work on a micro level. On
a meta level, it may drive cooperation when we enable people to share
their interest (at that moment in time). On a macro level data may
arrive at Wikidata and this will allow us to seek what articles
include specific data (think date of death for instance). On a meta
and macro level, we could ask readers what subjects they are missing.
This would provide an additional incentive for people to write. For this
last
suggestion we could measure what people are missing.
Anyway, relevance and importance depend on a point of view. When our
community is enabled to make a difference, it will help us with our
content. As a movement we know that there is enough that we do not
properly cover. Advocating these issues and targeting and educating
potential communities is where the WMF could play more of a role.
Thanks,
GerardM
[1]
http://ultimategerardm.blogspot.nl/2017/04/wikidata-
user-stories-sum-of-all.html
On 26 April 2017 at 13:48, Jonathan Cardy
<werespielchequers(a)gmail.com>
wrote:
I like to think that in time importance will win
out over
popularity. If Wikipedia still exists in fifty of five hundred years
time and we are
still
using pasteurisation and indeed still eating
hydrocarbon based
foods,
then
I suspect the pop group you mention will be less
frequently read
about
than
the pasteurisation process.
In the meantime if we try to work it out at all it has to be
something of a judgement call, and one we will occasionally get
wrong. Any guesses as
to
which current branches of science will be as
forgotten in a century
as phrenology is today?
At an extreme the weekly top ten most viewed articles are a good
guide to what is trending in the popular cultures of India and the
USA. I'm
assuming
that most modern pop culture is inherently
ephemeral. Of course
digital historians of future centuries may be rolling on the floor
laughing at
this
email, and the TV dramas currently being filmed
may still be widely
studied
and universally known classics while our leading
edge science lies
buried in the foundations of their science.
Regards
Jonathan
On 26 Apr 2017, at 08:50, Jane Darnell
<jane023(a)gmail.com> wrote:
Yes I totally agree that "importance is a relative metric rather
than absolute." I also agree that incoming links and pageviews are
not
accurate
> measurements of "importance" for all of the reasons you mention.
However,
> we are still a project that is actively
exploring the universe of
> knowledge, and leaning heavily on academia and other established
sources
we
> must "boldly go where no man has gone before" (and please feel
> free to insert "white, euro-centric" before the man part). So do
> you have any suggestions what we could measure going forward that
> would cough up
some
interesting stats to monitor? Pagewatching is useful , but
problematic because these are only assigned at page-creation,
while some marginal editor interest might be expanded to whole
categories (speaking as
someone
> who has thousands of pages watchlisted on multiple projects). I
> like
your
thoughts
about looking for key articles such as those used as the
"article
> as the "main" article for a category or as the title of a navbox
> ". I
am
> looking for similar usages of paintings as a
way to find popular
painters
> or paintings rather than just those
paintings which have articles
written
> about them (which are often written for
totally random reasons
> such as theft/sale/wikiproject).
>
> On Wed, Apr 26, 2017 at 5:39 AM, Kerry Raymond <
kerry.raymond(a)gmail.com>
> wrote:
>
>> Just a few musings on the issue of Importance and how to research
>> it
...
>
> I agree it is intuitive that importance is likely to be linked to
> pageviews and inbound links but, as the preliminary experiment
> showed,
it's
>> probably not that simple.
>>
>> Pageviews tells us something about importance to readers of
>> Wikipedia, while inbound links tells us something about
>> importance to writers of Wikipedia, and I suspect that writers
>> are not a proxy for readers as
the
>> editor surveys suggest that Wikipedia
writers are not typical of
broader
>> society on at least two variables:
gender and level of education
(might
be
> others, I can't remember).
>
> But I think importance is a relative metric rather than
> absolute. I
think
> by taking the mean value of importance across
a number of
> WikiProjects
in
>> the preliminary experiment may have lost something because it
>> tried (through averaging) to look at importance "generally". I
>> would suspect conducting an experiment considering only the
>> importance ratings wrt
to
a
> single WikiProject would be more likely to
show correlation with
pageviews
>> (wrt to other articles in that same WikiProject) and inbound links.
And
I
> think there are two kinds of inbound links to
be considered,
> those
coming
>> from other articles within the same WikiProject and those coming
>> from outside that Wikiproject. I suspect different insights will
>> be
obtained
by
> looking at both types of inbound links
separately rather than
> treating
them
> as an aggregate. I note also that
WikiProjects are not entirely
independent
> of one another but have relationships between
them. For example,
> The WikiProject Australian Roads describes itself as an
> "intersection" (ha
ha!)
> of WikiProject Highways and WikiProject
Australia, so I expect
> that we would find greater correlation in importance between
> related
WikiProjects
>> than between unrelated WikiProjects.
>>
>> When thinking about readers and pageviews, I think we have to ask
>> ourselves is there a difference between popularity and
>> importance. Or whether popularity *is* importance. I sense that,
>> as a group of
educated
>> people, those of us reading this
research mailing list probably
>> do
think
> there
is a difference. Certainly if there is no difference, then
> this research can stop now -- just judge importance by
> pageviews. Let's
assume
>> a difference then. When looking at pageviews of an article, they
>> are
not
>
always consistent over time. Here are the pageviews for
> Drottninggatan
>
>
https://tools.wmflabs.org/pageviews/?project=en.
> wikipedia.org&platform=all-access&agent=user&range=
> latest-90&pages=Drottninggatan
>
> Why so interesting on 8 April? A terrorist attack occurred there.
> This spike in pageviews occurs all the time when some topic is in
> the news
(even
> peripherally as in this case where it is not
the article about
> the terrorist attack but about the street in which it occurred).
> Did the
street
>> become more "important"? I think it became more interesting but
>> not
more
>
important. So I think we do have to be careful to understand that
pageviews
> probably reflect interest rather than
importance. I note that
> The Chainsmokers (a music group with a number of songs in the
> current USA
music
>> charts) gets many more Wikipedia article pageviews than the
>> Wikipedia article on Pasteurization but The Chainsmokers are not
>> rated as being
of
> high
importance by the relevant WikiProjects while Pasteurization
> is
very
>> important in WikiProject Food and Drink. Since pasteurisation
prevents a
> lot
of deaths, I think we might agree that in the real world
pasteurisation
> is more important than a music group
regardless of what pageviews
> tell
us.
>>
>>
https://tools.wmflabs.org/pageviews/?project=en.
>> wikipedia.org&platform=all-access&agent=user&range=
latest-90&pages=The_
>
Chainsmokers|Pasteurization
>
> Of course it is matters for Wikipedia's success that our
> *popular* articles are of high quality, but I think we have be
> cautious about pageviews being a proxy for importance.
>
> When we look at Wikipedia writers' decisions in tagging the
> importance
of
>> articles to WikiProjects, what do we find? As we know, project
>> tags
are
> often
placed on new articles (and often not subsequently
> reviewed). So while I find that quality tags are often
> out-of-date, the importance
seems
>> to be pretty accurate even on a new stub articles. This is
>> because it
is
> the
importance of the *topic* that is being assessed which is
independent
> of the Wikipedia article itself. Provided the
article is clear
> enough
about
>> what it is about and why it matters (which is the traditional
>> content
of
> that
first paragraph or two and failing to provide it will likely
result in
> speedy deletion of the new article),
assessment of the topic's
importance
> can be made even at new stub level. This
tells us that importance
> for Wikipedia writers is determined by something outside of
> Wikipedia
(probably
> their real-world knowledge of that topic
space -- one assumes
> that
project
>> taggers are quite interested in the topic space of that project).
While
>
article quality hopefully improves over time, I would be
> surprised if article importance greatly changed over time.
> Obviously there are counter-examples. I am guessing Donald
> Trump's article may have grown
in
>> importance over time but that's probably because his lede para
changed.
>> Adding President of the USA into the
lede paragraph makes him
>> much
more
>
important than he was before in the real world and internal to
Wikipedia he
> has acquired an inbound link from the
presumably high-importance
President
>> of the USA article. So I think it might be interesting to study
>> those articles whose importance does change over time to see if
>> there are
any
>
strong correlations with what is happening to the article inside
Wikipedia.
> I think it is this set of importance-changing
articles may be
> where we really learn what Wikipedia article characteristics are
> strongly
correlated
>> to "importance" given that importance itself appears to be pretty
stable
>> for most articles.
>>
>> Although not stated explicitly, I imagine we believe that
>> generally
less
>> important articles tend to link to more
important articles but
>> more important articles don't link to less important articles.
>> And hence in-bound links are likely to matter in assessing
>> importance and that in-bound links from "important" articles are
>> more valuable than
in-bound
>> links from less important articles
(which creates something of a
>> bootstrapping problem) similar to the issue to Google's PageRank
>> algorithms. But I think we do have some information that Google
doesn't
> >> have. The average webpage does not have a lede paragraph that
> >> situates
> the
> >> topic relative to other topics; a Wikipedia article does. If I
> >> have to choose to define Thing X in terms of Thing Y, it tends to
> >> suggest that
> Y is
> >> more important than X. If Y also defines itself in terms of X,
> >> then it tends to suggest they are equivalent in importance at some
way.
Indeed I
>
suspect when we get to the VERY IMPORTANT topics we will see this
> kind
of
> circular definition (e.g. you see circular
definitions in
> Wikipedia
around
> Philosophy and Knowledge). Aside, if you have
never done this
> before,
try
>> this experiment. Choose a random article (left hand tool bar in
Desktop
>> Wikipedia), then click the first link in
the article that matters
(i.e.
>
ignore links hatnotes or links inside parentheses). Repeat this
> first
link
>> clicking and sooner or later you will reach articles like
>> Knowledge
and
>> Philosophy, which all sit inside
circular definition groups.
>>
>> If we look at the Donald Trump article, his first sentence
>> contains
only
> two
links, one to List of Presidents of the USA and the other to
President
> of the USA. If we look at the those two
articles, we find that
> both of
them
>> mention Donald Trump in their lede paras (although not as early
>> as the first sentence) and before mentions of any other US
>> President
elsewhere
in
>> the article. Which is consistent with what we know about the real
world,
> >> the role of the President is more important than its
> >> officeholders and
> that
> >> the current officeholder has more importance than a past
officeholder.
So
>> topic importance does seems to be skewed towards the "present day".
>>
>> So I suspect the links in the lede paras are of greater relevance
>> to
the
>
assessment of importance than links further down in the article
> which
will
>> be more likely relate to details of a topic and may include
>> examples
and
>
counter-examples (this is a way in which high importance article
> may mention much lower importance articles). However, we do have
> to be a
little
>> bit careful here because of the MoS practice of not linking very
common
>> terms. For example, an Australian
article will often refer to
Australia
in
> the lede para but it will almost certainly
not be linked to the
Australia
> article (and any attempt to add such a link
will likely see it
> removed
with
> an edit summary that mentions
[[WP:Overlinking]]) whereas there
> is no problem if you link to an Australian state article, e.g.
> New South
Wales.
> So we might find that some very important
topics that often
> appear in
ledes
>> might get fewer links that you might expect because of the MoS
policies
on
> overlinking, which may be problem when
working with inbound
> links. It
may
> be that for "very common topics"
the presence of the article
> title (or
its
> synonyms) in the lede may have to be
considered as if it were an
in-bound
> link for statistical research purposes.
>
> Given all of the above, perhaps the most interesting group of
> articles
to
>> study in Wikipedia are those articles whose manually-assessed
importance
> has
changed over the life of the article AND which were NOT
> current
topics
> in the lifetime of Wikipedia (given the
influence of "current" on
> importance). But having said that, I wonder if that group of
> articles actually exists. Recently a newish Australian
> contributor expressed disappointment that all the new articles
> they had created were tagged
(by
>> others) as of Low Importance. My instinctive reply was "that's
normal, I
> think
of the thousands of articles I have started only a couple
> even
rated
> as Mid importance, this is because the really
important articles
> were
all
> started long ago precisely because they were
important". I
> suspect
topics
> that are very important (for reasons other
than being short-lived
> importance due in being "current" in the lifetime of Wikipedia)
> will generally show up as having started early in Wikipedia's
> life and that those that become more/less important over time
> will be largely linked
to
>> becoming or ceasing to be "current" topics). E.g. article
Pasteurization
>> started in May 2001 saying nothing more
than " Pasteurization is
>> the process of killing off bacteria in milk by quickly heating it
>> to a
near
>> boiling temperature, then quickly
cooling it again before the
>> taste
and
>> other desirable properties are affected.
The process was named
>> after
its
>> inventor, French scientist Louis
Pasteur. See also dairy products."
The
>> links in this very first version are
still present in its lede
paragraph
>
today, suggesting our understanding of "non-current" topics is
> stable
and
>> hence initial importance determinations can probably be
>> accurately
made.
> For
Pasteurization the Talk page shows it was not project-tagged
> until
2007
> when it was assigned High Importance as its
first assessment.
>
> I suspect we will find that initial manual assessment of article
> importance will be pretty accurate for most articles. And I
> suspect if
we
>> plot initial importance assessments against time of assessment,
>> we
will
>> find the higher importance articles
commenced life on Wikipedia
earlier
>> than the lower importance articles. If I
am correct, then there
>> isn't
a
lot
> of value in machine-assessment of importance
of topics because it
relates
>> to factors external to Wikipedia and often does not change over
>> time
and
>
therefore can often be correctly assessed manually even on new
> stub articles (and any unassessed articles can probably be rated
> as Low Importance as statistically that's almost certainly going
> to be
correct).
> If a topic becomes more important due to
"current" events, then
invariably
>> that article will be updated by many people and one of them will
sooner
or
> later manually adjust its importance. What is
less likely to
> happen is re-assessing downwards of Importance when an important
> "current" topic loses its importance when it is no longer
> current, e.g. are former
American
> presidents like Barack Obama or George W Bush
or further back
> less important now? These articles will not be updated frequently
> once the
topic
>> is no longer in the news and therefore it is less likely an
>> editor
will
>> notice and manually downgrade the
importance, so there may be a
greater
> role
for machine-assessment in downgrading importance rather than
upgrading
>> importance.
>>
>> Another area where there might be a role for machine-assessed
importance
>> in regards to POV-pushing where an
POV-motivated editor might
>> change
the
>> manual-assessment importance of articles
to be higher or lower
>> based
on
>> their POV (e.g. my political party is
Top Importance, other
>> parties
are
of
>> Low Importance). I suspect that often a page watcher would
>> correct or
at
> >> least question that kind of re-assessment. However, articles with
> >> few active pagewatchers you might get away with POV-pushing the
> >> article's importance tag because nobody noticed. In this
> >> situation, a machine assessment could be useful in spotting this
kind of thing.
>>
>> This suggests that another metric of interest to importance might
>> be number of pagewatchers, although I suspect that pagewatching
>> may
relate
>> more to caring about the article than to
caring about the topic.
>> And
one
> has
to be careful to distinguish active pagewatchers (those who
actually do
> review changes on their watchlists) from
those who don't, as that
> may
make
> a difference (although I am not sure we can
really tell which
pagewatchers
>> are truly actively reviewing as a "satisfactory review" doesn't
>> leave
a
> trace
whereas an "unsatisfactory" review is likely to lead to a
relatively
>> soon revert or some other change to the article, the article Talk
>> or
the
>> User Talk of reviewed contributor which
may be detectable).
>>
>> The other aspect of articles that occurs to me as being possibly
linked
to
> importance of the topic would be use of the
article as the "main"
article
> for a category or as the title of a navbox
(as it suggests that
> the articles in the category or navbox are in some way
> subordinate to the main/title article). Similarly for list
> articles, the "type" of the
list is
> often more important than its instances).
>
> Kerry
>
> -----Original Message-----
> From: Wiki-research-l [mailto:wiki-research-l-
bounces(a)lists.wikimedia.org]
> On Behalf Of Morten Wang
> Sent: Friday, 21 April 2017 6:04 AM
> To: Research into Wikimedia content and communities <
> wiki-research-l(a)lists.wikimedia.org>
> Subject: Re: [Wiki-research-l] Project exploring automated
classification
>> of article importance
>>
>> Hi Pine,
>>
>> These are great pointers to existing practices on enwiki, some of
which
>> I've been looking for and/or missed,
thanks!
>>
>>
>> Cheers,
>> Morten
>>
>>> On 19 April 2017 at 22:35, Pine W <wiki.pine(a)gmail.com> wrote:
>>>
>>> Hi Nettrom,
>>>
>>> A few resources from English Wikipedia regarding article
>>> importance
as
assessment#Statist
>>> ics
>>>
>>> I infer from the ENWP Wikicup's scoring protocol that for
>>> purposes of the competition, an article's "importance" is
>>> loosely inferred from the number of language editions of
>>> Wikipedia in which the article
>> appears:
>>>
https://en.wikipedia.org/wiki/Wikipedia:WikiCup/Scoring#Bonus_po
>>> ints
.
>>
>> HTH,
>>
>> Pine
>>
>>
>>> On Tue, Apr 18, 2017 at 4:17 PM, Morten Wang
>>> <nettrom(a)gmail.com>
wrote:
>>>
>>> Hello everyone,
>>>
>>> I am currently working with Aaron Halfaker and Dario
>>> Taraborelli at the Wikimedia Foundation on a project exploring
>>> automated classification of article importance. Our goal is to
>>> characterize the importance of an article within a given
>>> context and design a system to predict a relative importance
>>> rank. We have a project page on meta[1] and welcome comments
>> or
>>> thoughts on our talk page. You can of course also respond here
>>> on wiki-research-l, or send me an email.
>>>
>>> Before moving on to model-building I did a fairly thorough
>>> literature review, finding a myriad of papers spanning several
>>> disciplines. We have
>> a
>>> draft literature review also up on meta[2], which should give
>>> you a reasonable introduction to the topic. Again, comments or
>>> thoughts
(e.g.
>>> papers we’ve missed) on the talk
page, mailing list, or through
>>> email are welcome.
>>>
>>> Links:
>>>
>>> 1.
https://meta.wikimedia.org/wiki/Research:Automated_
>>> classification_of_article_importance
>>> <https://meta.wikimedia.org/wiki/Research:Automated_
>>> classification_of_article_importance>
>>> 2.
>>>
https://meta.wikimedia.org/wiki/Research:Studies_of_Importance
>>>
>>> Regards,
>>> Morten
>>> [[User:Nettrom]] aka [[User:SuggestBot]]
>>> _______________________________________________
>>> Wiki-research-l mailing list
>>> Wiki-research-l(a)lists.wikimedia.org
>>>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>> _______________________________________________
>> Wiki-research-l mailing list
>> Wiki-research-l(a)lists.wikimedia.org
>>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org