Hello Kerry,
A lot of good points. It is about how to work collaboratively using
footnotes. It does not really occur right now, and it is difficult to
verify something without having the book at hand - or even then.
We would need much stricter rules how to show what piece of information
does exactly come where from, and where to look for it. But in reality, I
write e.g. one paragraph in Wikipedia, summing up 2-3 pages of a book, and
then put a footnote with regard to these 2-3 pages at the end of the
paragraph. Otherwise, my process of writing would be much slower, and
possibly my summary would be worse.
This is indeed a "real world problem" (=difficult to solve for Wikipedia),
but a special one within a text with a large number of (partially
anonymous) contributors. It would help if a mouse over on existing text
could at least indicate which Wikipedia account was responsible for exactly
that text.
Kind regards,
Ziko
PS:
In your example, I'd say:
Joe Smith was born in London[1] in 1830[2].
Where [1] supports that he was born in London and [2] that he was born in
1830.
2017-05-09 17:26 GMT+02:00 Andrea Forte <andrea.forte(a)gmail.com>om>:
(meant to reply all!)
Kerry - right, re: real world problem, I meant that the other way around,
we will have both the reference text and the preceding text, if either one
changes, it'll be detectable. That won't do everything you are describing
but it will provide the data required to think through a lot of these
problems and come up with future approaches to understanding the link
between a reference and the text in which it is embedded.
You raise a really good point that I had totally missed - in past work I
captured revision-related data like username/timestamp/etc. I do want to
capture revisionid and revision-related metadata along with the reference
text itself. I've added a note about that to the proposed data structure,
thank you!
On Wed, May 3, 2017 at 11:07 AM, Kerry Raymond <kerry.raymond(a)gmail.com>
wrote:
The only thing is that the “real life” problem is
the text changing but
the citations stays the same. I don’t see the opposite happen much.
Another thought I had was of course to preserve details of the edit which
added the citation initially, user, timestamp, edit summary, etc
It would be interesting to find “cliques” (in the loose social sense not
the strict mathematical sense) of users who seem to use the same “clique
of
citations”. Such groups might be sockpuppets,
meatpuppets etc. Of course,
they might just be good faith editors accessing the same very useful
resources for their favourite topic area. But I guess if you “smell a
rat”
with one user or one source, then it might be
handy to explore any
“cliques” they appear to be operating within to look for suspicious
activity of the others.
I am not quite sure what we might learn from the edit summaries, but I
guess if they are not collected, we will never know if they do contain
any
interesting patterns.
Another thought that occurs to me is that there is at least one situation
when some the text of interest may follow the citation rather precede it
and that is list. E.g
The presidents of the USA are:<ref> one reliable source about all of the
presidents</ref>
· George Washington
· …
· Donald Trump
Also citations within tables pose a bit of a problem in terms of their
“span”. Is it just the cell with the citation? Is it more? I see tables
with the last column being used to hold citations for data that populates
that whole row.
Also citations in infoboxes where there is one field carrying some data
followed by a corresponding citation field, e.g. pop and pop_footnotes
(for
population in infobox Australian place).
The more I think about this issue, the more I despair. Not so much for
this project to build a citation database, but rather for the fact that
without any binding of article text to the citation, the connection
between
them is likely to degrade as successive
contributors come along and
modify
the article, particularly so if they cannot
access the source. I think we
have let ourselves be seduced into thinking that so long as we can
**see**
a lot of inline citations, [1][2][3] in our
article that it is
well-sourced, but if we really can’t explain what text is supported by
which source, is it really well-sourced? You might as well just add a
bibliography to the end and forget in-line citations. Now one might argue
this is just as true with a traditional journal article (again, no
explicit binding of text to source), but the difference is that a
traditional journal article has a single author or a group of
tightly-coupled authors writing the journal article over a relatively
short
period of time (weeks rather than years), who are
likely to have shared
access to every source being cited and are able to confer among
themselves
if needed to sort out any issue relating to
citations, so we can expect
the
citations to remain close to the text being
supported by the citation. In
Wikipedia, we have a disconnected set of authors operating over different
time frames over an article lifetime of many years who are unable to
share
their source materials and so I think the
coupling between text and
citation is inevitably likely to be lost because we leave no trace of the
coupling for the next contributor to uphold, even when everyone is acting
in good faith. Let’s call it “cite rot”, which I’ll define as a loss of
verifiability due to disconnect between article text and source.
It seems to me that we need to make the connection between text and
source
more explicit. Think of it from a reader
perspective, in most e-readers
you
can select a word or phrase and a dictionary
lookup is performed to tell
you the meaning of the word(s). How about if in the Wikipedia of 2030
(since we discussing movement strategy at the moment), the reader could
select some words and the sources are returned that supports them. E.g.
currently we might write
Joe Smith was born in London in 1830.[1][2]
Where [1] supports that he was born in London and [2] that he was born in
1830.
In my 2030 Wikipedia, if we clicked on London, cite [1] would highlight
(or something) and if we clicked on 1830, [2] would highlight and if we
clicked on born, both would highlight. That is the words “Joe Smith was
born in London” would be tagged as being [1] and “Joe Smith was born ….
In
1830” would be tagged as being [2]. And probably
a little pop-up with the
exact quote out of the source document might appear for your verification
pleasure.
Now of course we have enough problems with getting our contributors to
supply any sources, let alone binding them to chunks of text as my
proposal
would entail. But I hear the Movement Strategy
conversation is talking
about improved quality and is talking about improved verifiability, so
maybe it’s part of the quality assessment, if you want a VGA (verifiable
good article), the text-to-cite mapping must be embedded in the article
and
almost all of the text is “covered” (in the
mathematical sense) by the
mapping. Indeed, the extent of coverage could be a verifiability metric.
OK, maybe what I am proposing is not the way to go, but I think we ought
to be thinking about this issue of cite rot, because I think it’s a real
problem. I suspect it’s already out there but we don’t notice it because
we
**see** lots of inline citations and assume all
is well.
Kerry
*From:* Andrea Forte [mailto:andrea.forte@gmail.com]
*Sent:* Wednesday, 3 May 2017 11:46 PM
*To:* kerry.raymond(a)gmail.com
*Cc:* Research into Wikimedia content and communities <
wiki-research-l(a)lists.wikimedia.org>
*Subject:* Re: [Wiki-research-l] Citation Project - Comments Welcome!
...and YES, detecting when a reference has changed but the adjacent text
has not is something that will be detectable with the dataset we aim to
produce. That's a great idea!
On Tue, May 2, 2017 at 7:59 AM, Kerry Raymond <kerry.raymond(a)gmail.com>
wrote:
Just a couple of thoughts that cross my mind ...
If people use the {{cite book}} etc templates, it will be relatively easy
to work out what the components of the citation are. However if people
roll
their own, e.g.
<ref>[http://someurl This And That], Blah Blah 2000</ref>
you may have some difficulty working out what is what. I've just been
though a tedious exercise of updating a set of URLs using AWB over some
thousands of articles and some of the ways people roll their own
citations
were quite remarkable (and often quite
unhelpful). It may be that you
can't
extract much from such citations. However, the
good news is that if they
have a URL in them, it will probably be in plain-sight.
Whereas there are a number of templates that I regularly use for citation
like {{cite QHR}} (currently 1234 transclusions) and {{cite QPN}}
(currently 2738 transclusions) and {{Census 2011 AUS}} (4400
transclusions) all of which generate their URLs. I'm not sure how you
will
deal with these in terms of extracting URLs.
But whatever the limitations, it will be a useful dataset to answer some
interesting questions.
One phenomena I often see is new users updating information (e.g.
changing
the population of a town) while leaving behind
the old citation for the
previous value. So it superficially looks like the new information is
cited
to a reliable source when in fact it isn't.
I've often wished we could
automatically detect and raise a "warning" when the "text being
supported"
by the citation changes yet the citation does
not. The problem, of
course,
is that we only know where the citation appears
in the text and that we
presume it is in support for "some earlier" text (without being clear
exactly where it is). And if an article is reorganised, it may well
result
in the citation "drifting away" from
the text it supports or even that it
is in support of text that has been deleted. So I think it is important
to
know what text preceded the citation at the time
the citation first
appears
in the article history as it may be useful to
compare it against the text
that *now* appears before it. It is a great pity that (in these digital
times) we have not developed a citation model where you select chunks of
text and link your citation to them, so that the relationship between the
text and the citation is more apparent.
Kerry
-----Original Message-----
From: Wiki-research-l [mailto:wiki-research-l-
bounces(a)lists.wikimedia.org]
On Behalf Of Andrea Forte
Sent: Tuesday, 2 May 2017 5:18 AM
To: Research into Wikimedia content and communities <
Wiki-research-l(a)lists.wikimedia.org>
Subject: [Wiki-research-l] Citation Project - Comments Welcome!
Hi all,
One of my PhD students, Meen Chul Kim, is a data scientist with
experience
in bibliometrics and we will be working on some
citation-related research
together with Aaron and Dario in the coming months. Our main goal in the
short term is to develop an enhanced citation dataset that will allow for
future analyses of citation data associated with article quality,
lifecycle, editing trends, etc.
The project page is here:
https://meta.wikimedia.org/wiki/Research:Understanding_
the_context_of_citations_in_Wikipedia
The project is just getting started so this is a great time to offer
feedback and suggestions, especially for features of citations that we
should mine as a first step, since this will affect what the dataset can
be
used for in the future.
Looking forward to seeing some of you at WikiCite!!
Andrea
--
:: Andrea Forte
:: Associate Professor
:: College of Computing and Informatics, Drexel University
::
http://www.andreaforte.net
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
:: Andrea Forte
:: Associate Professor
:: College of Computing and Informatics, Drexel University
::
http://www.andreaforte.net
--
:: Andrea Forte
:: Associate Professor
:: College of Computing and Informatics, Drexel University
::
http://www.andreaforte.net
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l