Yes, we've parsed some citation data in the past and there are graduated
levels of interpretability... especially since we aim to look at citations
over time, early revisions are likely to have more variation than those in
more recent years when there have been more tools available to help people
format. Around 2005 I built a mediawiki extension (well, it turned out to
be a fork really) that structured the insertion of reference data in an
article and stored it in a separate reference table in the database. How I
wish I had figured out how to make that a scalable tool then, so we
wouldn't have this problem now!
One thing we've discussed is that although what we are really interested in
is the sources--what references point to--our ability to do understand what
those sources are is limited by how well we can successfully parse and
extract the reference text itself.
On Tue, May 2, 2017 at 7:59 AM, Kerry Raymond <kerry.raymond(a)gmail.com>
wrote:
Just a couple of thoughts that cross my mind ...
If people use the {{cite book}} etc templates, it will be relatively easy
to work out what the components of the citation are. However if people roll
their own, e.g.
<ref>[http://someurl This And That], Blah Blah 2000</ref>
you may have some difficulty working out what is what. I've just been
though a tedious exercise of updating a set of URLs using AWB over some
thousands of articles and some of the ways people roll their own citations
were quite remarkable (and often quite unhelpful). It may be that you can't
extract much from such citations. However, the good news is that if they
have a URL in them, it will probably be in plain-sight.
Whereas there are a number of templates that I regularly use for citation
like {{cite QHR}} (currently 1234 transclusions) and {{cite QPN}}
(currently 2738 transclusions) and {{Census 2011 AUS}} (4400
transclusions) all of which generate their URLs. I'm not sure how you will
deal with these in terms of extracting URLs.
But whatever the limitations, it will be a useful dataset to answer some
interesting questions.
One phenomena I often see is new users updating information (e.g. changing
the population of a town) while leaving behind the old citation for the
previous value. So it superficially looks like the new information is cited
to a reliable source when in fact it isn't. I've often wished we could
automatically detect and raise a "warning" when the "text being
supported"
by the citation changes yet the citation does not. The problem, of course,
is that we only know where the citation appears in the text and that we
presume it is in support for "some earlier" text (without being clear
exactly where it is). And if an article is reorganised, it may well result
in the citation "drifting away" from the text it supports or even that it
is in support of text that has been deleted. So I think it is important to
know what text preceded the citation at the time the citation first appears
in the article history as it may be useful to compare it against the text
that *now* appears before it. It is a great pity that (in these digital
times) we have not developed a citation model where you select chunks of
text and link your citation to them, so that the relationship between the
text and the citation is more apparent.
Kerry
-----Original Message-----
From: Wiki-research-l [mailto:wiki-research-l-bounces@lists.wikimedia.org]
On Behalf Of Andrea Forte
Sent: Tuesday, 2 May 2017 5:18 AM
To: Research into Wikimedia content and communities <
Wiki-research-l(a)lists.wikimedia.org>
Subject: [Wiki-research-l] Citation Project - Comments Welcome!
Hi all,
One of my PhD students, Meen Chul Kim, is a data scientist with experience
in bibliometrics and we will be working on some citation-related research
together with Aaron and Dario in the coming months. Our main goal in the
short term is to develop an enhanced citation dataset that will allow for
future analyses of citation data associated with article quality,
lifecycle, editing trends, etc.
The project page is here:
https://meta.wikimedia.org/wiki/Research:Understanding_
the_context_of_citations_in_Wikipedia
The project is just getting started so this is a great time to offer
feedback and suggestions, especially for features of citations that we
should mine as a first step, since this will affect what the dataset can be
used for in the future.
Looking forward to seeing some of you at WikiCite!!
Andrea
--
:: Andrea Forte
:: Associate Professor
:: College of Computing and Informatics, Drexel University
::
http://www.andreaforte.net
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
:: Andrea Forte
:: Associate Professor
:: College of Computing and Informatics, Drexel University
::