It's worth commenting that link rot occurs at a variety of ways.
The obvious way is that the URL is broken, error 404 is returned to the browser.
Or but rather than send a 404 to the browser, the site redirects you to a page that says
"Page not found" without an error 404.
Or but you are redirected to a search page from which does not find what you want. (A lot
of sites seem to be increasingly to be hiding content by returning it as search results
that you cannot archive).
Or but you are redirected to a general search page from which you may or may not find the
page you were after at a new URL
Or the URL has been replaced by a specialised search which will give you what you want but
not in a way that you can use for citing or archiving. A lot of sites seem to be
increasingly to be hiding content by returning it as search results.
Or the URL works but the contents on the page is not what was expected (different topic)
which occurs with sites that number (and then re-number their web pages) or when
cybersquatters buy an expired domain name.
Or the URL works, continues to be about the topic expected, but does not say anything to
back up the claim in the Wikipedia article because the content has changed since.
Or the URL works and the content NEVER said what the Wikipedia article claims (contributor
error or deliberate misleading).
And there may be more variations on the them that I have forgotten about.
Obviously these variations have to be detected in different ways. And for archive sites,
it is often impossible to recognise in an automated way that a lot of these have occurred.
It can be really tedious to wade through dozens of archived snapshots of a webpage finding
"Page not found" pages in your search for the "most recent
really-what-I-wanted content". This is a problem for the Internet Archive Bot.
So you often need a human to say "hey, it's broken" at which point the
Internet Archive Bot may try to fix it. Because the bot writers know that the bot can be
fooled by finding an "archived page" that actually doesn't replace the
deadlink with useful content, they put those very long messages on the Talk pages to try
to ask people to check the rescued citation. I don't know about other people, but when
the Internet Archive Bot was released, it deluged my watchlist and I simply stopped
checking its work (I could never have kept up). Now its volume has reduced but I'm now
trained to ignore it. I think it does a better job at archiving external links than
rescuing (but given the variations above, this is not to be wondered at).
At the end of the day, most deadlinks need a human in the loop for recovery. And it's
a huge task and a tedious one. But I do dabble in it from time to time for claims that
seem particularly "bold" or on articles that I care a little bit more about. So
let me talk about the process.
One of the problems is that for URLs that I did not add myself, I can see the deadlink
citation and I may have located what I think is a replacement page (whether on the
original website or from an archive or whatever), say with a similar-ish title appearing
to take about the topic of the Wikipedia article, but my problem is that I cannot tell
from the article how much of the content preceeding the citation (or in the case of bullet
lists, tables, etc, following the citation) is intended to be supported by the citation.
So I don't really know if some particular claim is supposed to be supported by the
nearest citation or whether it may be supported by another citation that has drifted a
long way away. I've emailed at some length previously about this problem of being
unable to relate chunks of texts in articles to citations and the citation rot that occurs
as the article grows and the citations drift into the wrong text (or just get deleted
because a subsequent editor can't see where they fit into the narrative or can't
be bothered to see). So, not quite knowing what information was supposed to be supported
by this citation, it is genuinely hard to say if the new URL I have found is or isn't
an adequate replacement. Am I doing more harm to replace it when I may not totally
confident, or should I leave it for someone else to decide (assuming someone else will
even try)? I often try to fix a deadlink citation but back away because I just don't
know if I am doing the right thing or not.
To try to get around the "citation rot" issues, if I am highly motivated that
day, I use WikiBlame to try to locate the version of the article in the History where the
citation was added. This gives me the best chance to know what information it was intended
to support. So then I go and look in Internet Archive and find the URL has been archived,
but the first archived version is AFTER the date of the version of the Wikipedia article
that added the citation. Is this a problem? Generally I take the risk and go for it if the
info seems to be consistent. At the end of the day, an archive has a series of snapshots
in time of a webpage, and it is difficult to know if the webpage as viewed by the people
adding the citation corresponds to any of these. Obviously the snapshots immediately
before and immediately after the URL access date are the best ones, but still may not
reflect the contents of a highly dynamic website as at the URL access date (newspaper home
pages are classic problems of this as the headline articles can change in minutes, far
faster than any archive can track).
Damn it, it just becomes too hard and, after a run of being unable to fix a deadlink
citation, I give up and do something more enjoyable.
The one exception where I do have greater success is when I am trying to rescue a citation
URL that I added myself. Although I may have reached the age when I don't remember
that I even created an article or its citation, nonetheless when I see them, some faint
memory is jogged and the synapses connect, and I generally do manage to decide if the
rescue URL suffices.
I am only discussing external links in citations here, but obviously similar comments
apply to external links in infoboxes or in the External Links section. Except that in the
External Links section, you often get very little context for the intended purpose of the
link. But then I figure that not much harm is done to delete such deadlinks if I cannot
find any plausible rescue URL after a reasonably diligent search.
The loss of citation URL matter more for article verification. Having said that, all
citation URLs are not equally important. Obviously the ones relating to the notability of
the subject are very important, as are those that support information which, if incorrect,
could cause significant harm to the reader (e.g. medical advice) or to the risk of libel
(e.g. biographies of living people). Some deadlink citation URLs support information that
seems plausible and isn't likely to cause harm to the reader if it's wrong and so
the loss of these citations is annoying but isn't a catastrophe, e.g. "Kenmore
was first settled in 1950 [cite] and its rugby team, the Kenmore Bulldogs Club, was formed
in 1955, competing in its first A-grade competition in 1957 [dead cite]".
So, summing up all of the above, it's a big problem, it's a hard problem, it's
a worthy problem, but if you are going to tackle it
* be realistic about what can be achieved by setting small goals
* be realistic about how little human volunteer effort is likely to be willing and able to
assist and make sure you use what you can get it in the most productive and fulfilling
way, to maintain their engagement (I think Internet Archive Bot burned volunteer
engagement by wanting too much too quickly on a task that was often too hard to even
understand, let alone carry out)
* focus the efforts on where the need is greatest (medical, BLP, or citations in the lede
of articles likely to relate to notability)
My 10ccs!
Kerry
-----Original Message-----
From: Wiki-research-l [mailto:wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of
Leila Zia
Sent: Monday, 26 June 2017 11:48 PM
To: Research into Wikimedia content and communities
<wiki-research-l(a)lists.wikimedia.org>
Subject: Re: [Wiki-research-l] link rot
Hi James,
On Mon, Jun 26, 2017 at 8:04 AM, James Salsman <jsalsman(a)gmail.com> wrote:
Is anyone studying the rate at which external links become unavailable
on Wikipedia projects?
I just did a quick tally and less than 40% of the external links cited
in the introductions of L1-vital enwiki health and social science
articles I sampled were good, and that's only counting those which
didn't already have a {{dead link}} tag.
I thought that the bots were doing a better job of replacing dead
links with archive copies than they apparently are.
Two items to share:
* In FY17-18 Annual Plan, Program 11 [1]: Objective 1, Outcome 1 is closely related to
your question/observation. I expect more research in this space as a result.
* InternetArchiveBot [2] is one bot that I know operates in this space. If you are
interested in it, it would be good to have a discussion with the team behind that bot to
learn how the bot currently operates and what it needs to be improved.
Best,
Leila
[1]
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/…
[2]
https://en.wikipedia.org/wiki/User:InternetArchiveBot
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l