Is anyone studying the rate at which external links become unavailable on Wikipedia projects?
I just did a quick tally and less than 40% of the external links cited in the introductions of L1-vital enwiki health and social science articles I sampled were good, and that's only counting those which didn't already have a {{dead link}} tag.
I thought that the bots were doing a better job of replacing dead links with archive copies than they apparently are. Do we need to fund this as an official effort?
Hi James,
On Mon, Jun 26, 2017 at 8:04 AM, James Salsman jsalsman@gmail.com wrote:
Is anyone studying the rate at which external links become unavailable on Wikipedia projects?
I just did a quick tally and less than 40% of the external links cited in the introductions of L1-vital enwiki health and social science articles I sampled were good, and that's only counting those which didn't already have a {{dead link}} tag.
I thought that the bots were doing a better job of replacing dead links with archive copies than they apparently are.
Two items to share:
* In FY17-18 Annual Plan, Program 11 [1]: Objective 1, Outcome 1 is closely related to your question/observation. I expect more research in this space as a result.
* InternetArchiveBot [2] is one bot that I know operates in this space. If you are interested in it, it would be good to have a discussion with the team behind that bot to learn how the bot currently operates and what it needs to be improved.
Best, Leila
[1] https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/D... [2] https://en.wikipedia.org/wiki/User:InternetArchiveBot
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
It's worth commenting that link rot occurs at a variety of ways.
The obvious way is that the URL is broken, error 404 is returned to the browser. Or but rather than send a 404 to the browser, the site redirects you to a page that says "Page not found" without an error 404. Or but you are redirected to a search page from which does not find what you want. (A lot of sites seem to be increasingly to be hiding content by returning it as search results that you cannot archive). Or but you are redirected to a general search page from which you may or may not find the page you were after at a new URL Or the URL has been replaced by a specialised search which will give you what you want but not in a way that you can use for citing or archiving. A lot of sites seem to be increasingly to be hiding content by returning it as search results. Or the URL works but the contents on the page is not what was expected (different topic) which occurs with sites that number (and then re-number their web pages) or when cybersquatters buy an expired domain name. Or the URL works, continues to be about the topic expected, but does not say anything to back up the claim in the Wikipedia article because the content has changed since. Or the URL works and the content NEVER said what the Wikipedia article claims (contributor error or deliberate misleading).
And there may be more variations on the them that I have forgotten about.
Obviously these variations have to be detected in different ways. And for archive sites, it is often impossible to recognise in an automated way that a lot of these have occurred. It can be really tedious to wade through dozens of archived snapshots of a webpage finding "Page not found" pages in your search for the "most recent really-what-I-wanted content". This is a problem for the Internet Archive Bot.
So you often need a human to say "hey, it's broken" at which point the Internet Archive Bot may try to fix it. Because the bot writers know that the bot can be fooled by finding an "archived page" that actually doesn't replace the deadlink with useful content, they put those very long messages on the Talk pages to try to ask people to check the rescued citation. I don't know about other people, but when the Internet Archive Bot was released, it deluged my watchlist and I simply stopped checking its work (I could never have kept up). Now its volume has reduced but I'm now trained to ignore it. I think it does a better job at archiving external links than rescuing (but given the variations above, this is not to be wondered at).
At the end of the day, most deadlinks need a human in the loop for recovery. And it's a huge task and a tedious one. But I do dabble in it from time to time for claims that seem particularly "bold" or on articles that I care a little bit more about. So let me talk about the process.
One of the problems is that for URLs that I did not add myself, I can see the deadlink citation and I may have located what I think is a replacement page (whether on the original website or from an archive or whatever), say with a similar-ish title appearing to take about the topic of the Wikipedia article, but my problem is that I cannot tell from the article how much of the content preceeding the citation (or in the case of bullet lists, tables, etc, following the citation) is intended to be supported by the citation. So I don't really know if some particular claim is supposed to be supported by the nearest citation or whether it may be supported by another citation that has drifted a long way away. I've emailed at some length previously about this problem of being unable to relate chunks of texts in articles to citations and the citation rot that occurs as the article grows and the citations drift into the wrong text (or just get deleted because a subsequent editor can't see where they fit into the narrative or can't be bothered to see). So, not quite knowing what information was supposed to be supported by this citation, it is genuinely hard to say if the new URL I have found is or isn't an adequate replacement. Am I doing more harm to replace it when I may not totally confident, or should I leave it for someone else to decide (assuming someone else will even try)? I often try to fix a deadlink citation but back away because I just don't know if I am doing the right thing or not.
To try to get around the "citation rot" issues, if I am highly motivated that day, I use WikiBlame to try to locate the version of the article in the History where the citation was added. This gives me the best chance to know what information it was intended to support. So then I go and look in Internet Archive and find the URL has been archived, but the first archived version is AFTER the date of the version of the Wikipedia article that added the citation. Is this a problem? Generally I take the risk and go for it if the info seems to be consistent. At the end of the day, an archive has a series of snapshots in time of a webpage, and it is difficult to know if the webpage as viewed by the people adding the citation corresponds to any of these. Obviously the snapshots immediately before and immediately after the URL access date are the best ones, but still may not reflect the contents of a highly dynamic website as at the URL access date (newspaper home pages are classic problems of this as the headline articles can change in minutes, far faster than any archive can track).
Damn it, it just becomes too hard and, after a run of being unable to fix a deadlink citation, I give up and do something more enjoyable.
The one exception where I do have greater success is when I am trying to rescue a citation URL that I added myself. Although I may have reached the age when I don't remember that I even created an article or its citation, nonetheless when I see them, some faint memory is jogged and the synapses connect, and I generally do manage to decide if the rescue URL suffices.
I am only discussing external links in citations here, but obviously similar comments apply to external links in infoboxes or in the External Links section. Except that in the External Links section, you often get very little context for the intended purpose of the link. But then I figure that not much harm is done to delete such deadlinks if I cannot find any plausible rescue URL after a reasonably diligent search.
The loss of citation URL matter more for article verification. Having said that, all citation URLs are not equally important. Obviously the ones relating to the notability of the subject are very important, as are those that support information which, if incorrect, could cause significant harm to the reader (e.g. medical advice) or to the risk of libel (e.g. biographies of living people). Some deadlink citation URLs support information that seems plausible and isn't likely to cause harm to the reader if it's wrong and so the loss of these citations is annoying but isn't a catastrophe, e.g. "Kenmore was first settled in 1950 [cite] and its rugby team, the Kenmore Bulldogs Club, was formed in 1955, competing in its first A-grade competition in 1957 [dead cite]".
So, summing up all of the above, it's a big problem, it's a hard problem, it's a worthy problem, but if you are going to tackle it
* be realistic about what can be achieved by setting small goals * be realistic about how little human volunteer effort is likely to be willing and able to assist and make sure you use what you can get it in the most productive and fulfilling way, to maintain their engagement (I think Internet Archive Bot burned volunteer engagement by wanting too much too quickly on a task that was often too hard to even understand, let alone carry out) * focus the efforts on where the need is greatest (medical, BLP, or citations in the lede of articles likely to relate to notability)
My 10ccs!
Kerry
-----Original Message----- From: Wiki-research-l [mailto:wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of Leila Zia Sent: Monday, 26 June 2017 11:48 PM To: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Subject: Re: [Wiki-research-l] link rot
Hi James,
On Mon, Jun 26, 2017 at 8:04 AM, James Salsman jsalsman@gmail.com wrote:
Is anyone studying the rate at which external links become unavailable on Wikipedia projects?
I just did a quick tally and less than 40% of the external links cited in the introductions of L1-vital enwiki health and social science articles I sampled were good, and that's only counting those which didn't already have a {{dead link}} tag.
I thought that the bots were doing a better job of replacing dead links with archive copies than they apparently are.
Two items to share:
* In FY17-18 Annual Plan, Program 11 [1]: Objective 1, Outcome 1 is closely related to your question/observation. I expect more research in this space as a result.
* InternetArchiveBot [2] is one bot that I know operates in this space. If you are interested in it, it would be good to have a discussion with the team behind that bot to learn how the bot currently operates and what it needs to be improved.
Best, Leila
[1] https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/D... [2] https://en.wikipedia.org/wiki/User:InternetArchiveBot
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
James Salsman jsalsman@gmail.com writes:
Is anyone studying the rate at which external links become unavailable on Wikipedia projects?
There've been a few studies over the years, but none of the ones I know of are recent. One from 2011 that may nonetheless be interesting is:
P. Tzekou, S. Stamou, N. Kirtsis, N. Zotos. Quality assessment of Wikipedia external links. In Proceedings of Web Information Systems and Technologies (WEBIST) 2011. http://www.dblab.upatras.gr/download/nlp/NLP-Group-Pubs/11-WEBIST_Wikipedia_...
-Mark
On 06/26/2017 04:43 PM, Mark J. Nelson wrote:
James Salsman jsalsman@gmail.com writes:
Is anyone studying the rate at which external links become unavailable on Wikipedia projects?
There've been a few studies over the years, but none of the ones I know of are recent. One from 2011 that may nonetheless be interesting is:
P. Tzekou, S. Stamou, N. Kirtsis, N. Zotos. Quality assessment of Wikipedia external links. In Proceedings of Web Information Systems and Technologies (WEBIST) 2011. http://www.dblab.upatras.gr/download/nlp/NLP-Group-Pubs/11-WEBIST_Wikipedia_...
-Mark
There is a Japanese study from the same year:
Characteristics of external links and dead links in Japanese Wikipedia https://dx.doi.org/10.2964/JSIK.21_06
This was found on the Scholia page for the link rot topic:
https://tools.wmflabs.org/scholia/topic/Q1193907
/Finn
wiki-research-l@lists.wikimedia.org