The only thing is that the “real life” problem is the text changing but the citations
stays the same. I don’t see the opposite happen much.
Another thought I had was of course to preserve details of the edit which added the
citation initially, user, timestamp, edit summary, etc
It would be interesting to find “cliques” (in the loose social sense not the strict
mathematical sense) of users who seem to use the same “clique of citations”. Such groups
might be sockpuppets, meatpuppets etc. Of course, they might just be good faith editors
accessing the same very useful resources for their favourite topic area. But I guess if
you “smell a rat” with one user or one source, then it might be handy to explore any
“cliques” they appear to be operating within to look for suspicious activity of the
others.
I am not quite sure what we might learn from the edit summaries, but I guess if they are
not collected, we will never know if they do contain any interesting patterns.
Another thought that occurs to me is that there is at least one situation when some the
text of interest may follow the citation rather precede it and that is list. E.g
The presidents of the USA are:<ref> one reliable source about all of the
presidents</ref>
* George Washington
* …
* Donald Trump
Also citations within tables pose a bit of a problem in terms of their “span”. Is it just
the cell with the citation? Is it more? I see tables with the last column being used to
hold citations for data that populates that whole row.
Also citations in infoboxes where there is one field carrying some data followed by a
corresponding citation field, e.g. pop and pop_footnotes (for population in infobox
Australian place).
The more I think about this issue, the more I despair. Not so much for this project to
build a citation database, but rather for the fact that without any binding of article
text to the citation, the connection between them is likely to degrade as successive
contributors come along and modify the article, particularly so if they cannot access the
source. I think we have let ourselves be seduced into thinking that so long as we can
*see* a lot of inline citations, [1][2][3] in our article that it is well-sourced, but if
we really can’t explain what text is supported by which source, is it really well-sourced?
You might as well just add a bibliography to the end and forget in-line citations. Now one
might argue this is just as true with a traditional journal article (again, no explicit
binding of text to source), but the difference is that a traditional journal article has a
single author or a group of tightly-coupled authors writing the journal article over a
relatively short period of time (weeks rather than years), who are likely to have shared
access to every source being cited and are able to confer among themselves if needed to
sort out any issue relating to citations, so we can expect the citations to remain close
to the text being supported by the citation. In Wikipedia, we have a disconnected set of
authors operating over different time frames over an article lifetime of many years who
are unable to share their source materials and so I think the coupling between text and
citation is inevitably likely to be lost because we leave no trace of the coupling for the
next contributor to uphold, even when everyone is acting in good faith. Let’s call it
“cite rot”, which I’ll define as a loss of verifiability due to disconnect between article
text and source.
It seems to me that we need to make the connection between text and source more explicit.
Think of it from a reader perspective, in most e-readers you can select a word or phrase
and a dictionary lookup is performed to tell you the meaning of the word(s). How about if
in the Wikipedia of 2030 (since we discussing movement strategy at the moment), the reader
could select some words and the sources are returned that supports them. E.g. currently we
might write
Joe Smith was born in London in 1830.[1][2]
Where [1] supports that he was born in London and [2] that he was born in 1830.
In my 2030 Wikipedia, if we clicked on London, cite [1] would highlight (or something) and
if we clicked on 1830, [2] would highlight and if we clicked on born, both would
highlight. That is the words “Joe Smith was born in London” would be tagged as being [1]
and “Joe Smith was born …. In 1830” would be tagged as being [2]. And probably a little
pop-up with the exact quote out of the source document might appear for your verification
pleasure.
Now of course we have enough problems with getting our contributors to supply any sources,
let alone binding them to chunks of text as my proposal would entail. But I hear the
Movement Strategy conversation is talking about improved quality and is talking about
improved verifiability, so maybe it’s part of the quality assessment, if you want a VGA
(verifiable good article), the text-to-cite mapping must be embedded in the article and
almost all of the text is “covered” (in the mathematical sense) by the mapping. Indeed,
the extent of coverage could be a verifiability metric.
OK, maybe what I am proposing is not the way to go, but I think we ought to be thinking
about this issue of cite rot, because I think it’s a real problem. I suspect it’s already
out there but we don’t notice it because we *see* lots of inline citations and assume all
is well.
Kerry
From: Andrea Forte [mailto:andrea.forte@gmail.com]
Sent: Wednesday, 3 May 2017 11:46 PM
To: kerry.raymond(a)gmail.com
Cc: Research into Wikimedia content and communities
<wiki-research-l(a)lists.wikimedia.org>
Subject: Re: [Wiki-research-l] Citation Project - Comments Welcome!
...and YES, detecting when a reference has changed but the adjacent text has not is
something that will be detectable with the dataset we aim to produce. That's a great
idea!
On Tue, May 2, 2017 at 7:59 AM, Kerry Raymond <kerry.raymond(a)gmail.com
<mailto:kerry.raymond@gmail.com> > wrote:
Just a couple of thoughts that cross my mind ...
If people use the {{cite book}} etc templates, it will be relatively easy to work out what
the components of the citation are. However if people roll their own, e.g.
<ref>[http://someurl This And That], Blah Blah 2000</ref>
you may have some difficulty working out what is what. I've just been though a tedious
exercise of updating a set of URLs using AWB over some thousands of articles and some of
the ways people roll their own citations were quite remarkable (and often quite
unhelpful). It may be that you can't extract much from such citations. However, the
good news is that if they have a URL in them, it will probably be in plain-sight.
Whereas there are a number of templates that I regularly use for citation like {{cite
QHR}} (currently 1234 transclusions) and {{cite QPN}} (currently 2738 transclusions) and
{{Census 2011 AUS}} (4400 transclusions) all of which generate their URLs. I'm not
sure how you will deal with these in terms of extracting URLs.
But whatever the limitations, it will be a useful dataset to answer some interesting
questions.
One phenomena I often see is new users updating information (e.g. changing the population
of a town) while leaving behind the old citation for the previous value. So it
superficially looks like the new information is cited to a reliable source when in fact it
isn't. I've often wished we could automatically detect and raise a
"warning" when the "text being supported" by the citation changes yet
the citation does not. The problem, of course, is that we only know where the citation
appears in the text and that we presume it is in support for "some earlier" text
(without being clear exactly where it is). And if an article is reorganised, it may well
result in the citation "drifting away" from the text it supports or even that it
is in support of text that has been deleted. So I think it is important to know what text
preceded the citation at the time the citation first appears in the article history as it
may be useful to compare it against the text that *now* appears before it. It is a great
pity that (in these digital times) we have not developed a citation model where you select
chunks of text and link your citation to them, so that the relationship between the text
and the citation is more apparent.
Kerry
-----Original Message-----
From: Wiki-research-l [mailto:wiki-research-l-bounces@lists.wikimedia.org
<mailto:wiki-research-l-bounces@lists.wikimedia.org> ] On Behalf Of Andrea Forte
Sent: Tuesday, 2 May 2017 5:18 AM
To: Research into Wikimedia content and communities
<Wiki-research-l(a)lists.wikimedia.org <mailto:Wiki-research-l@lists.wikimedia.org>
>
Subject: [Wiki-research-l] Citation Project - Comments Welcome!
Hi all,
One of my PhD students, Meen Chul Kim, is a data scientist with experience in
bibliometrics and we will be working on some citation-related research together with Aaron
and Dario in the coming months. Our main goal in the short term is to develop an enhanced
citation dataset that will allow for future analyses of citation data associated with
article quality, lifecycle, editing trends, etc.
The project page is here:
https://meta.wikimedia.org/wiki/Research:Understanding_the_context_of_citat…
The project is just getting started so this is a great time to offer feedback and
suggestions, especially for features of citations that we should mine as a first step,
since this will affect what the dataset can be used for in the future.
Looking forward to seeing some of you at WikiCite!!
Andrea
--
:: Andrea Forte
:: Associate Professor
:: College of Computing and Informatics, Drexel University
::
http://www.andreaforte.net
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org <mailto:Wiki-research-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
:: Andrea Forte
:: Associate Professor
:: College of Computing and Informatics, Drexel University
::
http://www.andreaforte.net