[Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

List overview All Threads
Download

newer

older

[Foundation-l] Report to the Board...

Re: [Foundation-l] Office hours...

emijrp

24 Aug 2010 24 Aug '10

1:57 p.m.

Hi all; I want to make a proposal about external links preservation. Many times, when you check an external link or a link reference, the website is dead or offline. This websites are important, because they are the sources for the facts showed in the articles. Internet Archive searches for interesting websites to save in their hard disks, so, we can send them our external links sql tables (all projects and languages of course). They improve their database and we always have a copy of the sources text to check when needed. I think that this can be a cool partnership. Regards, emijrp

Show replies by date

David Gerard

24 Aug 24 Aug

2:08 p.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

On 24 August 2010 14:57, emijrp <emijrp(a)gmail.com> wrote:

...

I want to make a proposal about external links preservation. Many times, when you check an external link or a link reference, the website is dead or offline. This websites are important, because they are the sources for the facts showed in the articles. Internet Archive searches for interesting websites to save in their hard disks, so, we can send them our external links sql tables (all projects and languages of course). They improve their database and we always have a copy of the sources text to check when needed. I think that this can be a cool partnership.

+1 - d.

Pascal Martin

2:20 p.m.

Dear all, The french ( since two years ) and hungarian wikipedia are using our archive, and it will be implemented in other french project. You could see them on all the french article : http://fr.wikipedia.org/wiki/Maupassant#Notes_et_r.C3.A9f.C3.A9rences just click on archive. When a link is making we storage the content in real time after the link is making. If you need more information you could contact us on freenode on the chanel : #linterweb. Sincerely Pascal ----- Original Message ----- From: "David Gerard" <dgerard(a)gmail.com> To: "Wikimedia Foundation Mailing List" <foundation-l(a)lists.wikimedia.org> Sent: Tuesday, August 24, 2010 4:08 PM Subject: Re: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

...

On 24 August 2010 14:57, emijrp <emijrp(a)gmail.com> wrote:

+1 - d. _______________________________________________ foundation-l mailing list foundation-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Ray Saintonge

25 Aug 25 Aug

12:45 a.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

David Gerard wrote:

...

On 24 August 2010 14:57, emijrp <emijrp(a)gmail.com> wrote:

Are people who clean up dead links taking the time to check Internet Archive to se if the page in question is there? Ec

Samuel Klein

12:48 a.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

Here's the Archive's on-demand service: http://archive-it.org That would be the most reliable way to set up the partnership emijrp proposes. And it's certainly a good idea. Figuring out how to make it work for almost all editors and make it spam-proof may be interesting. SJ On Tue, Aug 24, 2010 at 8:45 PM, Ray Saintonge <saintonge(a)telus.net> wrote:

...

David Gerard wrote:

On 24 August 2010 14:57, emijrp <emijrp(a)gmail.com> wrote:

Are people who clean up dead links taking the time to check Internet Archive to se if the page in question is there? Ec _______________________________________________ foundation-l mailing list foundation-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

-- Samuel Klein identi.ca:sj w:user:sj

George Herbert

12:56 a.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

On Tue, Aug 24, 2010 at 5:48 PM, Samuel Klein <meta.sj(a)gmail.com> wrote:

...

David Gerard wrote:

On 24 August 2010 14:57, emijrp <emijrp(a)gmail.com> wrote:

-- Samuel Klein identi.ca:sj w:user:sj _______________________________________________ foundation-l mailing list foundation-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Samuel Klein

1:13 a.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

I've asked Gordon Mohr @ IA about how to work with archive-it. I will cc: this thread on any response. SJ On Tue, Aug 24, 2010 at 8:56 PM, George Herbert <george.herbert(a)gmail.com> wrote:

...

On Tue, Aug 24, 2010 at 5:48 PM, Samuel Klein <meta.sj(a)gmail.com> wrote:

David Gerard wrote:

On 24 August 2010 14:57, emijrp <emijrp(a)gmail.com> wrote: > I want to make a proposal about external links preservation. Many times, > when you check an external link or a link reference, the website is dead or > offline. This websites are important, because they are the sources for the > facts showed in the articles. Internet Archive searches for interesting > websites to save in their hard disks, so, we can send them our external > links sql tables (all projects and languages of course). They improve their > database and we always have a copy of the sources text to check when needed. > I think that this can be a cool partnership. > +1

I actually proposed some form of Wikimedia / IArchive link collaboration some years ago to a friend who worked there at the time; however, they left shortly afterwards. I like SJ's particular idea. Who has current contacts with Brewster Kahle or someone else over there? -- -george william herbert george.herbert(a)gmail.com _______________________________________________ foundation-l mailing list foundation-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

-- Samuel Klein identi.ca:sj w:user:sj

emijrp

3:16 p.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

Thanks SJ. 2010/8/25 Samuel Klein <meta.sj(a)gmail.com>

...

I've asked Gordon Mohr @ IA about how to work with archive-it. I will cc: this thread on any response. SJ On Tue, Aug 24, 2010 at 8:56 PM, George Herbert <george.herbert(a)gmail.com> wrote:

On Tue, Aug 24, 2010 at 5:48 PM, Samuel Klein <meta.sj(a)gmail.com> wrote: > Here's the Archive's on-demand service: > > http://archive-it.org > > That would be the most reliable way to set up the partnership emijrp > proposes. And it's certainly a good idea. Figuring out how to make > it work for almost all editors and make it spam-proof may be > interesting. > > SJ > > > > On Tue, Aug 24, 2010 at 8:45 PM, Ray Saintonge <saintonge(a)telus.net>

wrote:

>> David Gerard wrote: >>> On 24 August 2010 14:57, emijrp <emijrp(a)gmail.com> wrote: >>> >>>> I want to make a proposal about external links preservation. Many

times,

>>>> when you check an external link or a link reference, the website is

dead or

>>>> offline. This websites are important, because they are the sources

for the

>>>> facts showed in the articles. Internet Archive searches for

interesting

>>>> websites to save in their hard disks, so, we can send them our

external

>>>> links sql tables (all projects and languages of course). They improve

their

>>>> database and we always have a copy of the sources text to check when

needed.

>> I think that this can be a cool partnership. >> > +1 > > Are people who clean up dead links taking the time to check Internet Archive to se if the page in question is there? Ec _______________________________________________ foundation-l mailing list foundation-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Samuel Klein

28 Aug 28 Aug

7:48 a.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

Gordon @ IA was most friendly and helpful. archive-it is a subscription service for focused collections of sites; he had a different idea better suited to our work. Gordon writes:

...

Now, given the importance of Wikipedia and editorial significant of things it outlinks-to, perhaps we could set up something specially focused on its content (and the de facto stream of newly-occurring outlinks), that would require no conscious effort by editors but greatly increase the odds that anything linked from Wikipedia would (a few months down the line) also be in our Archive. Is there (or could there be) a feed of all outlinks that IA could crawl almost nonstop?

That sounds excellent to me, if possible (and I think close to what emijrp had in mind!) What would it take to produce such a feed? SJ PS - An aside: IA's policies include taking down any links on request, so this would not be a foolproof archive, but a 99% one. On Tue, Aug 24, 2010 at 9:13 PM, Samuel Klein <meta.sj(a)gmail.com> wrote:

...

I've asked Gordon Mohr @ IA about how to work with archive-it. I will cc: this thread on any response. SJ On Tue, Aug 24, 2010 at 8:56 PM, George Herbert <george.herbert(a)gmail.com> wrote:

On Tue, Aug 24, 2010 at 5:48 PM, Samuel Klein <meta.sj(a)gmail.com> wrote:

David Gerard wrote: > On 24 August 2010 14:57, emijrp <emijrp(a)gmail.com> wrote: > >> I want to make a proposal about external links preservation. Many times, >> when you check an external link or a link reference, the website is dead or >> offline. This websites are important, because they are the sources for the >> facts showed in the articles. Internet Archive searches for interesting >> websites to save in their hard disks, so, we can send them our external >> links sql tables (all projects and languages of course). They improve their >> database and we always have a copy of the sources text to check when needed. >> I think that this can be a cool partnership. >> > +1 > > Are people who clean up dead links taking the time to check Internet Archive to se if the page in question is there? Ec _______________________________________________ foundation-l mailing list foundation-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

-- Samuel Klein identi.ca:sj w:user:sj

emijrp

8:36 p.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

Do we send a cc: to Wikimedia tech mailing list? 2010/8/28 Samuel Klein <meta.sj(a)gmail.com>

...

Gordon @ IA was most friendly and helpful. archive-it is a subscription service for focused collections of sites; he had a different idea better suited to our work. Gordon writes:

Now, given the importance of Wikipedia and editorial significant of

things

it outlinks-to, perhaps we could set up something specially focused on

its

content (and the de facto stream of newly-occurring outlinks), that would require no conscious effort by editors but greatly increase the odds that anything linked from Wikipedia would (a few months down the line) also be in our Archive. Is there (or could there be) a feed of all outlinks that

could crawl almost nonstop?

I've asked Gordon Mohr @ IA about how to work with archive-it. I will cc: this thread on any response. SJ On Tue, Aug 24, 2010 at 8:56 PM, George Herbert <george.herbert(a)gmail.com> wrote: > On Tue, Aug 24, 2010 at 5:48 PM, Samuel Klein <meta.sj(a)gmail.com>

wrote:

>> Here's the Archive's on-demand service: >> >> http://archive-it.org >> >> That would be the most reliable way to set up the partnership emijrp >> proposes. And it's certainly a good idea. Figuring out how to make >> it work for almost all editors and make it spam-proof may be >> interesting. >> >> SJ >> >> >> >> On Tue, Aug 24, 2010 at 8:45 PM, Ray Saintonge <saintonge(a)telus.net>

wrote:

>>> David Gerard wrote: >>>> On 24 August 2010 14:57, emijrp <emijrp(a)gmail.com> wrote: >>>> >>>>> I want to make a proposal about external links preservation. Many

times,

>>>>> when you check an external link or a link reference, the website is

dead or

>>>>> offline. This websites are important, because they are the sources

for the

>>>>> facts showed in the articles. Internet Archive searches for

interesting

>>>>> websites to save in their hard disks, so, we can send them our

external

>>>>> links sql tables (all projects and languages of course). They

improve their

>>>>> database and we always have a copy of the sources text to check when

needed.

>>>>> I think that this can be a cool partnership. >>>>> >>>> +1 >>>> >>>> >>> Are people who clean up dead links taking the time to check Internet >>> Archive to se if the page in question is there? >>> >>> >>> Ec >>> >>> _______________________________________________ >>> foundation-l mailing list >>> foundation-l(a)lists.wikimedia.org >>> Unsubscribe:

https://lists.wikimedia.org/mailman/listinfo/foundation-l

> -- Samuel Klein identi.ca:sj w:user:sj _______________________________________________ foundation-l mailing list foundation-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

-- Samuel Klein identi.ca:sj w:user:sj

Excirial

10:17 p.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

*What would it take to produce such a feed?** * A real-time feed may or may not be the best idea, for several reasons. - One issue is that every edit would have to be examined not only for external links, but for external links that were not present previously. Doing this real-time may cause slowdowns or additional load for the servers - keep in mind that we would have to scan external links on all edits for all Wikipedia's; Counted together this would result in a very, very busy feed towards IA. - Sometimes added links are spam or otherwise not acceptable, which means they may be removed soon after. In such a case man would prefer not having them archived, since it would be a waste of time and work for IA. An alternate solution could be forwarding a list of new links every day. The Database Layout<http://upload.wikimedia.org/wikipedia/commons/4/41/Mediawiki-data… Wikimedia seems to sugest that all external links are stored in a separate table in the database (And i presume this includes links in reference tags). I wonder if it would be possible to dump this entire table for IA, and afterwards send incremental change packages<http://en.wikipedia.org/wiki/Changeset>to them (Once a day perhaps?). That way they would always have a list of external links used by Wikipedia, and it would decrease the problem with performance hits, spam and links no longer used. If we only forwarded a feed with NEW links IA might end up with a long list of links which are removed over time. And above everything - the External Links table is simply a database table, which should be incredibly easy to read and process for IA, without custom coding required to read and store a feed. But perhaps the people at the tech mailing list have another \ better idea on how this should work :) ~Excirial On Sat, Aug 28, 2010 at 9:48 AM, Samuel Klein <meta.sj(a)gmail.com> wrote:

...

Gordon @ IA was most friendly and helpful. archive-it is a subscription service for focused collections of sites; he had a different idea better suited to our work. Gordon writes:

Now, given the importance of Wikipedia and editorial significant of

things

it outlinks-to, perhaps we could set up something specially focused on

its

could crawl almost nonstop?

wrote:

>>> David Gerard wrote: >>>> On 24 August 2010 14:57, emijrp <emijrp(a)gmail.com> wrote: >>>> >>>>> I want to make a proposal about external links preservation. Many

times,

>>>>> when you check an external link or a link reference, the website is

dead or

>>>>> offline. This websites are important, because they are the sources

for the

>>>>> facts showed in the articles. Internet Archive searches for

interesting

>>>>> websites to save in their hard disks, so, we can send them our

external

>>>>> links sql tables (all projects and languages of course). They

improve their

>>>>> database and we always have a copy of the sources text to check when

needed.

https://lists.wikimedia.org/mailman/listinfo/foundation-l

-- Samuel Klein identi.ca:sj w:user:sj

theo10011

11 p.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

A real time feed wouldn't be a smart idea neither would only new links. New external links are probably the most reliable ones, if they dont work today then theres probably no point in preserving them. Link rot is the biggest problem here, external links which might be 5-6 years old or more. I suggested DeadURL.com because it re-directs to previous versions maintained by other archives after including *deadurl.com/ *in front of the dead link. Ideally, there should be a way to redirect to older versions of a page through an internal template to include before any dead links. I think that would be the easiest way to implement a change without any technical overhaul. Theo * * * * On Sun, Aug 29, 2010 at 3:47 AM, Excirial <wp.excirial(a)gmail.com> wrote:

...

for

Wikimedia seems to sugest that all external links are stored in a separate table in the database (And i presume this includes links in reference tags). I wonder if it would be possible to dump this entire table for IA, and afterwards send incremental change packages<http://en.wikipedia.org/wiki/Changeset>to them (Once a day perhaps?). That way they would always have a list of external links used by Wikipedia, and it would decrease the problem with performance hits, spam and links no longer used. If we only forwarded a feed with NEW links IA might end up with a long list of links which are removed over time. And above everything - the External Links table is simply a database table, which should be incredibly easy to read and process for IA, without custom coding required to read and store a feed. But perhaps the people at the tech mailing list have another \ better idea on how this should work :) ~Excirial On Sat, Aug 28, 2010 at 9:48 AM, Samuel Klein <meta.sj(a)gmail.com> wrote:

Gordon @ IA was most friendly and helpful. archive-it is a subscription service for focused collections of sites; he had a different idea better suited to our work. Gordon writes:

Now, given the importance of Wikipedia and editorial significant of

things

it outlinks-to, perhaps we could set up something specially focused on

its > content (and the de facto stream of newly-occurring outlinks), that

would

> require no conscious effort by editors but greatly increase the odds

that

> anything linked from Wikipedia would (a few months down the line) also

> in our Archive. Is there (or could there be) a feed of all outlinks

that

could crawl almost nonstop?

wrote:

>>> David Gerard wrote: >>>> On 24 August 2010 14:57, emijrp <emijrp(a)gmail.com> wrote: >>>> >>>>> I want to make a proposal about external links preservation. Many

times, >>>>>> when you check an external link or a link reference, the website

dead or

>>>>> offline. This websites are important, because they are the sources

for the

>>>>> facts showed in the articles. Internet Archive searches for

interesting

>>>>> websites to save in their hard disks, so, we can send them our

external

>>>>> links sql tables (all projects and languages of course). They

improve their >>>>>> database and we always have a copy of the sources text to check

when

needed.

https://lists.wikimedia.org/mailman/listinfo/foundation-l >>>> >>> >>> >>> >>> -- >>> Samuel Klein identi.ca:sj w:user:sj >>> >>> _______________________________________________ >>> foundation-l mailing list >>> foundation-l(a)lists.wikimedia.org >>> Unsubscribe:

https://lists.wikimedia.org/mailman/listinfo/foundation-l

>> >> >> I actually proposed some form of Wikimedia / IArchive link >> collaboration some years ago to a friend who worked there at the time; >> however, they left shortly afterwards. >> >> I like SJ's particular idea. Who has current contacts with Brewster >> Kahle or someone else over there? >> >> >> -- >> -george william herbert >> george.herbert(a)gmail.com >> >> _______________________________________________ >> foundation-l mailing list >> foundation-l(a)lists.wikimedia.org >> Unsubscribe:

https://lists.wikimedia.org/mailman/listinfo/foundation-l

-- Samuel Klein identi.ca:sj w:user:sj

_______________________________________________ foundation-l mailing list foundation-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Excirial

11:31 p.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

*A real time feed wouldn't be a smart idea neither would only new links. New external links are probably the most reliable ones. *After reading this part i am not entirely certain if you caught my drift correctly, or if you missed it somehow. In case you understood it correctly apologies for stating it again.* *What i intended to say was that we should forward a completely list of all external links to IA, in the form of the External Links database table. This means that all current links would be known to IA, and could therefor be checked. Because such a table would be large to transfer, we could opt to forward the changes to it once a day, which would result in a lot less data traffic having to be send. In other words - IA would have a complete list of all external links on Wikipedia, and that list would be updated once a day (Removing all links no longer used, while equally adding links added that day). * Ideally, there should be a way to redirect to older versions of a page* *through an internal template to include before any dead links. I think that would be the easiest way to implement a change without any technical overhaul. * Keep in mind that this partnership suggestion seems to focus on this: *Greatly increase the odds that anything linked from Wikipedia would also be in our Archive*. The feed towards them is simply a means to flag "Important" pages so that they are crawled more often, or at least crawled once they are reported which would increase their change of being saved in the archive. How we subsequently handle this stored data is another, but still different concern. Even so we do have several related templates already (Here<http://en.wikipedia.org/wiki/Template:Deadlink>)k>). ~Excirial* * On Sun, Aug 29, 2010 at 1:00 AM, theo10011 <de10011(a)gmail.com> wrote:

...

servers

- keep in mind that we would have to scan external links on all edits for all Wikipedia's; Counted together this would result in a very, very busy feed towards IA. - Sometimes added links are spam or otherwise not acceptable, which means they may be removed soon after. In such a case man would prefer not

having

them archived, since it would be a waste of time and work for IA. An alternate solution could be forwarding a list of new links every day. The Database Layout<

http://upload.wikimedia.org/wikipedia/commons/4/41/Mediawiki-database-schem…

for

table

for IA, and afterwards send incremental change packages<http://en.wikipedia.org/wiki/Changeset>to them (Once a day perhaps?). That way they would always have a list of external links used by Wikipedia, and it would decrease the problem with performance hits, spam and links no longer used. If we only forwarded a feed with NEW links IA might end up with a long list of links which are

removed

over time. And above everything - the External Links table is simply a database table, which should be incredibly easy to read and process for

IA,

without custom coding required to read and store a feed. But perhaps the people at the tech mailing list have another \ better

idea

on how this should work :) ~Excirial On Sat, Aug 28, 2010 at 9:48 AM, Samuel Klein <meta.sj(a)gmail.com> wrote: > Gordon @ IA was most friendly and helpful. archive-it is a > subscription service for focused collections of sites; he had a > different idea better suited to our work. > > Gordon writes: > > Now, given the importance of Wikipedia and editorial significant of > things > > it outlinks-to, perhaps we could set up something specially focused

its > content (and the de facto stream of newly-occurring outlinks), that

would

> require no conscious effort by editors but greatly increase the odds

that > > anything linked from Wikipedia would (a few months down the line)

also

> in our Archive. Is there (or could there be) a feed of all outlinks

that > IA > > could crawl almost nonstop? > > That sounds excellent to me, if possible (and I think close to what > emijrp had in mind!) What would it take to produce such a feed? > > SJ > > PS - An aside: IA's policies include taking down any links on request, > so this would not be a foolproof archive, but a 99% one. > > > On Tue, Aug 24, 2010 at 9:13 PM, Samuel Klein <meta.sj(a)gmail.com>

wrote:

> > I've asked Gordon Mohr @ IA about how to work with archive-it. I

will

> > cc: this thread on any response. > > > > SJ > > > > On Tue, Aug 24, 2010 at 8:56 PM, George Herbert > > <george.herbert(a)gmail.com> wrote: > >> On Tue, Aug 24, 2010 at 5:48 PM, Samuel Klein <meta.sj(a)gmail.com> > wrote: > >>> Here's the Archive's on-demand service: > >>> > >>> http://archive-it.org > >>> > >>> That would be the most reliable way to set up the partnership

emijrp

> >>> proposes. And it's certainly a good idea. Figuring out how to

make

> >>> it work for almost all editors and make it spam-proof may be > >>> interesting. > >>> > >>> SJ > >>> > >>> > >>> > >>> On Tue, Aug 24, 2010 at 8:45 PM, Ray Saintonge <

saintonge(a)telus.net>

> wrote: > >>>> David Gerard wrote: > >>>>> On 24 August 2010 14:57, emijrp <emijrp(a)gmail.com> wrote: > >>>>> > >>>>>> I want to make a proposal about external links preservation.

Many

times, >>>>>> when you check an external link or a link reference, the website

is > dead or > >>>>>> offline. This websites are important, because they are the

sources

for the

>>>>> facts showed in the articles. Internet Archive searches for

interesting

>>>>> websites to save in their hard disks, so, we can send them our

external

>>>>> links sql tables (all projects and languages of course). They

improve their >>>>>> database and we always have a copy of the sources text to check

when > needed. > >>>>>> I think that this can be a cool partnership. > >>>>>> > >>>>> +1 > >>>>> > >>>>> > >>>> Are people who clean up dead links taking the time to check

Internet

>>> Archive to se if the page in question is there? >>> >>> >>> Ec >>> >>> _______________________________________________ >>> foundation-l mailing list >>> foundation-l(a)lists.wikimedia.org >>> Unsubscribe:

https://lists.wikimedia.org/mailman/listinfo/foundation-l > >> > >> > >> I actually proposed some form of Wikimedia / IArchive link > >> collaboration some years ago to a friend who worked there at the

time;

>> however, they left shortly afterwards. >> >> I like SJ's particular idea. Who has current contacts with Brewster >> Kahle or someone else over there? >> >> >> -- >> -george william herbert >> george.herbert(a)gmail.com >> >> _______________________________________________ >> foundation-l mailing list >> foundation-l(a)lists.wikimedia.org >> Unsubscribe:

https://lists.wikimedia.org/mailman/listinfo/foundation-l

> > -- Samuel Klein identi.ca:sj w:user:sj

_______________________________________________ foundation-l mailing list foundation-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Ryan Kaldari

31 Aug 31 Aug

9:31 p.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

A real-time feed of external links is overkill. As mentioned by others, the chief problem is linkrot of old links. All we need to do is dump the contents of externallinks.el_to from the database once a year or so, run a hex to ASCII conversion on it, zip it, and email it to someone at the Internet Archive. Anyone with access to the databases should be able to do this fairly easily. Rather than trying to engineer a complicated system that will take a year to implement, why not take this simple approach that will take care of 90+% of the problem? Ryan Kaldari

Rodan Bury

9:49 p.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

A dump "once a year or so" is not enough: the average life span of a website is 3 months. Kind regards, Dodoïste 2010/8/31 Ryan Kaldari <rkaldari(a)wikimedia.org>

...

Chad

9:59 p.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

On Tue, Aug 31, 2010 at 5:31 PM, Ryan Kaldari <rkaldari(a)wikimedia.org> wrote:

...

Why once a year? We already get a successful externallinks dump every dump cycle. Even the enwiki one is only half a month old[0]. If someone wants to work with Internet Archive or anyone else on this, the data is already there. -Chad [0] http://dumps.wikimedia.org/enwiki/20100817/

Henning Schlottmann

25 Aug 25 Aug

8:40 a.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

On 25.08.2010 02:45, Ray Saintonge wrote:

...

Are people who clean up dead links taking the time to check Internet Archive to se if the page in question is there?

No one should even touch a presumed dead link unless he or she has the expertise to check the link for a simple restructuring on the linked site and do a search in the internet archive. Personally I consider removing "dead links" without such a check as a very dangerous form of vandalism. Ciao Henning

Nikola Smolenski

24 Aug 24 Aug

2:19 p.m.

On 08/24/2010 03:57 PM, emijrp wrote:

...

I wanted to suggest this for a long time. I see two more reasons for this: - We are often copying free images or text from various sites (for example flickr but other ones too). It happens that these sites go offline or change their licenses later. Having such an archive, archived by an independent organization, would be indisputable proof of copyright status. - Wikipedia often writes articles about current events, and these link to various news organizations as sources. It happens sometimes that these sources stealthily change their content for various reasons. Such an archive, if it would be able to quickly follow Wikipedia's new links, would be a strong deterrent against this Orwellian trend.

wiki-list＠phizz.demon.co.uk

7:05 p.m.

Nikola Smolenski wrote:

...

Personally I wouldn't rely on a flickr CC license as being in any way reliable. http://commons.wikimedia.org/wiki/Commons:Flickr_washing I've seen too many AP photographs cropped to remove the AP attribute and uploaded to flickr as CC-BY to accept a flickr CC license at face value. In most cases the person doing so is probably taking stuff already cropped, and probably believes that if it is on the internet its public domain. No university, publisher, or newspaper has used my CC licensed images either commercially or non-commercially without checking with me first that the work is actually CC licensed. They have always carried out some form of due diligence to ascertain that the image is either licensed properly, and that they get a specific license to reuse. IOW they obtain a 'paper trail' of permission.

...

- Wikipedia often writes articles about current events, and these link to various news organizations as sources. It happens sometimes that these sources stealthily change their content for various reasons. Such an archive, if it would be able to quickly follow Wikipedia's new links, would be a strong deterrent against this Orwellian trend.

If someone is making copies of web pages that is a copyright violation. Unless they have, in the US, specific exemption from the US Copyright Office, that can lead to some heavy legal issues. The internet archive happens to have limited permissions on obsolete games and software, but otherwise it respects copyright and robots.txt and applies the directives retroactively. http://www.archive.org/about/terms.php http://en.wikipedia.org/wiki/Internet_Archive#Healthcare_Advocates.2C_Inc. http://web.archive.org/web/20020923133856rn_1/www.nytimes.com/auth/login?UR…

Rodan Bury

7:23 p.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

2010/8/24 <wiki-list(a)phizz.demon.co.uk>

...

Nikola Smolenski wrote:

I wanted to suggest this for a long time. I see two more reasons for

this: > > - We are often copying free images or text from various sites (for > example flickr but other ones too). It happens that these sites go > offline or change their licenses later. Having such an archive, archived > by an independent organization, would be indisputable proof of copyright > status.

...

The internet archive [...] respects copyright and robots.txt and applies the directives retroactively.

And so does Wikiwix. Wikiwix do respect robots.txt. Linterweb's advocate certified there is no legal issue with Wikiwix. Regards, Dodoïste

Nikola Smolenski

8:19 p.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

Дана Tuesday 24 August 2010 21:05:05 wiki-list(a)phizz.demon.co.uk написа:

...

Nikola Smolenski wrote:

That is another issue entirely. And in order to determine if an image has been washed in such a way and who did it you have to know its origin.

...

If the original website is gone, they can't even call to check.

...

It appears that so far this has not been a problem in practice, and anyway if they are willing to take the risk, who are we to stop them?

Robert Rohde

8:27 p.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

On Tue, Aug 24, 2010 at 12:05 PM, <wiki-list(a)phizz.demon.co.uk> wrote: <snip>

...

No university, publisher, or newspaper has used my CC licensed images either commercially or non-commercially without checking with me first that the work is actually CC licensed. They have always carried out some form of due diligence to ascertain that the image is either licensed properly, and that they get a specific license to reuse. IOW they obtain a 'paper trail' of permission.

Good for you. Most professional publishers do make every effort to carry out due diligence. However, I have had a several cases where well known newspapers and magazines appropriated my images without attempting to contact me (and in some cases even without providing any attribution). It does leave me to wonder how many other times my images might have been used professionally and/or improperly and I just don't know about it. -Robert Rohde

teun spaans

2:47 p.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

This looks like a solid organization. Solid in the sense that it wont go suddenly offline. Such links may be valuable for: - article references to sources, in case the source goes offline - article references to sources, in case thesource changes its content - media copies when the source changes or removes a license I took a look at the example in the french wiki, and didnt spot a date in the archive reference. If the source changes its content, this may pose a problem. kind regards, teun On Tue, Aug 24, 2010 at 3:57 PM, emijrp <emijrp(a)gmail.com> wrote:

...

Pascal Martin

3:06 p.m.

Dear teun, Linterweb is working around the wikipedia before 2007 : http://en.wikipedia.org/wiki/Wikipedia So i am sure we will be online in the futur :) Looks like: http://wikiwix.com http://okawix.com Not a problem, if the source changes there is no change in our cache, it is what want the french community. Now we are working in feature , to add a parameter in the link wich needs frequently update. Sincerely Pascal ----- Original Message ----- From: "teun spaans" <teun.spaans(a)gmail.com> To: "Wikimedia Foundation Mailing List" <foundation-l(a)lists.wikimedia.org> Sent: Tuesday, August 24, 2010 4:47 PM Subject: Re: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

...

_______________________________________________ foundation-l mailing list foundation-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Bence Damokos

3:27 p.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

Hi, On Tue, Aug 24, 2010 at 4:47 PM, teun spaans <teun.spaans(a)gmail.com> wrote:

...

I took a look at the example in the french wiki, and didnt spot a date in the archive reference. If the source changes its content, this may pose a problem. If you click on an archive link the top frame will display the exact date

of the archiving - I think the reason it is not displayed by default on the French Wikipedia is because the archive links are generated by JavaScript on the fly. (At least that was the case the last time I looked at the French Wikipedia.) Having the ability to store multiple copies of the same webpage (for different dates) was one of the first feature requests we had at the Hungarian Wikipedia and it seems they are working on it. Still, Wikiwix's service is very convenient and hassle free for all the static websites or references. Webcitation.org also has a service for on-demand archiving and they do store multiple versions of the same page. Unfortunately their service is often intermittent and their website tends to go dark, but otherwise it is a convenient service for manual archiving. (I had a bot once that sent each link through its service on the Hungarian Wikipedia, and for a time the English Wikipedia had a similar bot. Best regards, Bence

emijrp

4:32 p.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

We need a trusted and reliable archiving project (and perhaps mirrors). Webcite or Wikiwix can be great projects, but, how long will they be in the web? Until 2012? 2015? 2020? The average life of a website is 77 days[1], and we see dead links everywhere in Wikipedia articles. This is a big problem, and not only for Wikipedia, Internet builds and destroyes information too fast. Internet Archive is a nonprofit foundation, and it is running since 1996, so I think that it is a stable project and they are going to create mirrors in more countries (now there is a mirror in Alexandria). But, of course, Webcite or Wikiwix can help storing web copies (3 different archiving projects are better than only 1). Regards, emijrp [1] http://www.archive.org/about/faqs.php 2010/8/24 Bence Damokos <bdamokos(a)gmail.com>

...

Hi, On Tue, Aug 24, 2010 at 4:47 PM, teun spaans <teun.spaans(a)gmail.com> wrote:

David Gerard

4:37 p.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

On 24 August 2010 17:32, emijrp <emijrp(a)gmail.com> wrote:

...

Internet Archive is a nonprofit foundation, and it is running since 1996, so I think that it is a stable project and they are going to create mirrors in more countries (now there is a mirror in Alexandria). But, of course, Webcite or Wikiwix can help storing web copies (3 different archiving projects are better than only 1).

That's a key point: have multiple archives easily supported. (Hopefully not as complicated as what happens when you click on an ISBN.) - d.

theo10011

5:10 p.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

Its a great idea, using the wayback machine to ward of link rot. I support it but doesn't Google cache offer a similar service. there is also deadURL.com which uses Google Cache, the Internet Archive, and user submissions for gathering dead links. I would guess that Google Cache would have the highest and the longest reliability, at least as long as Google exists, its their business. Regards Theo On Tue, Aug 24, 2010 at 10:07 PM, David Gerard <dgerard(a)gmail.com> wrote:

...

On 24 August 2010 17:32, emijrp <emijrp(a)gmail.com> wrote:

Internet Archive is a nonprofit foundation, and it is running since 1996,

I think that it is a stable project and they are going to create mirrors

more countries (now there is a mirror in Alexandria). But, of course, Webcite or Wikiwix can help storing web copies (3 different archiving projects are better than only 1).

That's a key point: have multiple archives easily supported. (Hopefully not as complicated as what happens when you click on an ISBN.) - d. _______________________________________________ foundation-l mailing list foundation-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Rodan Bury

5:27 p.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

Google's cache only last a handful of days. It's very useful if the Website is down temporarily (for maintenance or overload for example) but totally useless otherwise. Kind Regards, Dodoïste 2010/8/24 theo10011 <de10011(a)gmail.com>

...

Ray Saintonge

25 Aug 25 Aug

12:58 a.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

Rodan Bury wrote:

...

Google's cache only last a handful of days. It's very useful if the Website is down temporarily (for maintenance or overload for example) but totally useless otherwise.

Even if Google Cache were less ephemeral, it's hard to avoid the soupçon of evil that permeates Google. Ec

...

2010/8/24 theo10011 <de10011(a)gmail.com> > Its a great idea, using the wayback machine to ward of link rot. I support > it but doesn't Google cache offer a similar service. there is also > deadURL.com which uses Google Cache, the Internet Archive, and user > submissions for gathering dead links. > > I would guess that Google Cache would have the highest and the longest > reliability, at least as long as Google exists, its their business. >

Robert Rohde

24 Aug 24 Aug

7:50 p.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

Does anyone know what the status of the Internet Archive is with respect to being a practical ongoing concern? In the last couple years IA has added relatively little web-based content. For example, their Wayback Machine currently offers: www.nytimes.com: 11 pages since 2006 en.wikipedia.org: 5 pages since 2008 www.nasa.gov: 12 pages since 2008 scienceblogs.com: 0 pages since 2008 It gives the impression that they are so ineffective at archiving recent content as to be effectively irrelevant. They do have a warning that it can take 6 or more months for newly accessed content to be incorporated into their database, but at this point the delay has been significantly more than that. Even at their peak they rarely archived more than a few hundred pages per major domain per year, which still amounts to a tiny fraction of the internet. The idea of seeking collaborations with people that archive web content is a good one, but I don't know that IA is really in a position to be all that useful. -Robert Rohde On Tue, Aug 24, 2010 at 6:57 AM, emijrp <emijrp(a)gmail.com> wrote:

...

Excirial

8:13 p.m.

New subject: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

* It gives the impression that they are so ineffective at archiving recent content as to be effectively irrelevant.* Your not the only person asking that question, have a look at this FAQ entry<http://www.archive.org/about/faqs.php#103>and This forum post<http://www.archive.org/post/320741/large-site-with-no-entries-at-al…10>. To specifically quote the FAQ: ** * It generally takes 6 months or more (up to 24 months) for pages to appear in the Wayback Machine after they are collected, because of delays in transferring material to long-term storage and indexing, or the requirements of our collection partners. * *In some cases, crawled content from certain projects can appear in a much shorter timeframe — as little as a few weeks from when it was crawled. Older material for the same pages and sites may still appear separately, months later. * *There is no access to files before they appear in the Wayback Machine. * * Even at their peak they rarely archived more than a few hundred pages per major domain per year, which still amounts to a tiny fraction of the internet* Keep in mind that sub-pages are indexed separately. For example the Administrators noticeboard <http://web.archive.org/web/*/http://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard>and blocking policy<http://web.archive.org/web/*/http://en.wikipedia.org/wiki/Wikiped… indexed at least several times a year. Equally keep in mind that the reliable sources we use rarely change content on a later date. A news article published in a news paper is static, and most news article's posted are equally static (With one or two updates before being moved from the main page). As of such we don't need a high interval for updates - a single back link is often more then sufficient for referencing purposes, since we aren't keeping a revision history for sources. ~Excirial On Tue, Aug 24, 2010 at 9:50 PM, Robert Rohde <rarohde(a)gmail.com> wrote:

...

Hi all; I want to make a proposal about external links preservation. Many times, when you check an external link or a link reference, the website is dead

offline. This websites are important, because they are the sources for

the

facts showed in the articles. Internet Archive searches for interesting websites to save in their hard disks, so, we can send them our external links sql tables (all projects and languages of course). They improve

their

database and we always have a copy of the sources text to check when

needed.

I think that this can be a cool partnership. Regards, emijrp _______________________________________________ foundation-l mailing list foundation-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

_______________________________________________ foundation-l mailing list foundation-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

4980

days inactive

4987

days old

wikimedia-l@lists.wikimedia.org

Manage subscription

31 comments

17 participants

tags (0)

participants (17)

Bence Damokos
Chad
David Gerard
emijrp
Excirial
George Herbert
Henning Schlottmann
Nikola Smolenski
Pascal Martin
Ray Saintonge
Robert Rohde
Rodan Bury
Ryan Kaldari
Samuel Klein
teun spaans
theo10011
wiki-list＠phizz.demon.co.uk