Dear all

 

I don’t think this is a difficult problem, two points should be clarified first:

1. Al most all facts on Wikipedia need not be sourced.

2. Since sourcing cannot inform us of truth/falsehood of a fact – it, at best, indicates the authority of its source.

 

I am not a lawyer, only a “Information  Specialist.” So this following should be double checked with legal:

 

Fact: Google, Caches practically anything it indexes:

However attempts by some site owners to claim this cache as a violation of their copyright have been consistently repealed/dismissed by US law courts.

I am not even sure what legal grounds we used, but caching is considered part of web tech and e.g. the browser caches are also not considered copyright
violations.

 

On the drawing board of the Next Generation Search Engine is “content analytics” capability for doing authority assessment of references using:

1.       A Transitive Bibliometric authority model.

2.       A metric of reference longevity. (Fad vs. Fact test)

3.       Bootstrapping using content analysis where access to full text of source was made available.

 

While the above model is complex, it would take (me) about 2 weeks of work to set  up prototype reference repository --

a Nutch (crawler) + Solr + Storage (say MySql/Hbase/Cassandra) combination to index:

external links, 

references including urls,

references with no urls.

This data would be immediately consumable via http using standards based request. (A SOLR feature).

 

 

To add Integration with existing Search UI would probably  take another 2 weeks. As would adding support for caching/indexing most significant non html document formats.

 

However It would not be able to access  content behind pay walls without access to a password. If and only if WMF sanctions this strategy - I could also draft a ‘win win’ policy for Encouraging such “Hidden Web” resource owners to provide free access to such a crawler. And possibly even open up their pay walls  to our editors .

e.g. remove “No Follow” directive from links to high WP:RS partners…

 

I hope this help.

 

 

Oren Bochman.

 

MediaWiki Search Developer.

 

From: wikidata-l-bounces@lists.wikimedia.org [mailto:wikidata-l-bounces@lists.wikimedia.org] On Behalf Of John Erling Blad
Sent: Sunday, April 01, 2012 10:01 PM
To: Discussion list for the Wikidata project.
Subject: Re: [Wikidata-l] Archiving references for facts?

 

Archiving a page should be pretty safe as long as the archived copy is only for internal use, that means something like OTRS. If the archived copy is republished it _might_ be viewed as a copyright infringement.

Still note that the archived copy can be used for automatic verification, ie extract a quote and check that against a stored value, without infringing any copyright. If a publication is withdrawn it might be an indication that something is seriously wrong with the page, and no matter what the archived copy at WebCitation says the page can't be trusted.

Its really a very difficult problem.

Jeblad

On 1. apr. 2012 14.08, "Helder" <helder.wiki@gmail.com> wrote: