Hi all;
I want to make a proposal about external links preservation. Many times, when you check an external link or a link reference, the website is dead or offline. This websites are important, because they are the sources for the facts showed in the articles. Internet Archive searches for interesting websites to save in their hard disks, so, we can send them our external links sql tables (all projects and languages of course). They improve their database and we always have a copy of the sources text to check when needed.
I think that this can be a cool partnership.
Regards, emijrp
On 24 August 2010 14:57, emijrp emijrp@gmail.com wrote:
I want to make a proposal about external links preservation. Many times, when you check an external link or a link reference, the website is dead or offline. This websites are important, because they are the sources for the facts showed in the articles. Internet Archive searches for interesting websites to save in their hard disks, so, we can send them our external links sql tables (all projects and languages of course). They improve their database and we always have a copy of the sources text to check when needed. I think that this can be a cool partnership.
+1
- d.
Dear all,
The french ( since two years ) and hungarian wikipedia are using our archive, and it will be implemented in other french project.
You could see them on all the french article : http://fr.wikipedia.org/wiki/Maupassant#Notes_et_r.C3.A9f.C3.A9rences
just click on archive.
When a link is making we storage the content in real time after the link is making.
If you need more information you could contact us on freenode on the chanel : #linterweb.
Sincerely Pascal
----- Original Message ----- From: "David Gerard" dgerard@gmail.com To: "Wikimedia Foundation Mailing List" foundation-l@lists.wikimedia.org Sent: Tuesday, August 24, 2010 4:08 PM Subject: Re: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive
On 24 August 2010 14:57, emijrp emijrp@gmail.com wrote:
I want to make a proposal about external links preservation. Many times, when you check an external link or a link reference, the website is dead or offline. This websites are important, because they are the sources for the facts showed in the articles. Internet Archive searches for interesting websites to save in their hard disks, so, we can send them our external links sql tables (all projects and languages of course). They improve their database and we always have a copy of the sources text to check when needed. I think that this can be a cool partnership.
+1
- d.
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
David Gerard wrote:
On 24 August 2010 14:57, emijrp emijrp@gmail.com wrote:
I want to make a proposal about external links preservation. Many times, when you check an external link or a link reference, the website is dead or offline. This websites are important, because they are the sources for the facts showed in the articles. Internet Archive searches for interesting websites to save in their hard disks, so, we can send them our external links sql tables (all projects and languages of course). They improve their database and we always have a copy of the sources text to check when needed. I think that this can be a cool partnership.
+1
Are people who clean up dead links taking the time to check Internet Archive to se if the page in question is there?
Ec
Here's the Archive's on-demand service:
That would be the most reliable way to set up the partnership emijrp proposes. And it's certainly a good idea. Figuring out how to make it work for almost all editors and make it spam-proof may be interesting.
SJ
On Tue, Aug 24, 2010 at 8:45 PM, Ray Saintonge saintonge@telus.net wrote:
David Gerard wrote:
On 24 August 2010 14:57, emijrp emijrp@gmail.com wrote:
I want to make a proposal about external links preservation. Many times, when you check an external link or a link reference, the website is dead or offline. This websites are important, because they are the sources for the facts showed in the articles. Internet Archive searches for interesting websites to save in their hard disks, so, we can send them our external links sql tables (all projects and languages of course). They improve their database and we always have a copy of the sources text to check when needed. I think that this can be a cool partnership.
+1
Are people who clean up dead links taking the time to check Internet Archive to se if the page in question is there?
Ec
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Tue, Aug 24, 2010 at 5:48 PM, Samuel Klein meta.sj@gmail.com wrote:
Here's the Archive's on-demand service:
That would be the most reliable way to set up the partnership emijrp proposes. And it's certainly a good idea. Figuring out how to make it work for almost all editors and make it spam-proof may be interesting.
SJ
On Tue, Aug 24, 2010 at 8:45 PM, Ray Saintonge saintonge@telus.net wrote:
David Gerard wrote:
On 24 August 2010 14:57, emijrp emijrp@gmail.com wrote:
I want to make a proposal about external links preservation. Many times, when you check an external link or a link reference, the website is dead or offline. This websites are important, because they are the sources for the facts showed in the articles. Internet Archive searches for interesting websites to save in their hard disks, so, we can send them our external links sql tables (all projects and languages of course). They improve their database and we always have a copy of the sources text to check when needed. I think that this can be a cool partnership.
+1
Are people who clean up dead links taking the time to check Internet Archive to se if the page in question is there?
Ec
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
-- Samuel Klein identi.ca:sj w:user:sj
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
I actually proposed some form of Wikimedia / IArchive link collaboration some years ago to a friend who worked there at the time; however, they left shortly afterwards.
I like SJ's particular idea. Who has current contacts with Brewster Kahle or someone else over there?
I've asked Gordon Mohr @ IA about how to work with archive-it. I will cc: this thread on any response.
SJ
On Tue, Aug 24, 2010 at 8:56 PM, George Herbert george.herbert@gmail.com wrote:
On Tue, Aug 24, 2010 at 5:48 PM, Samuel Klein meta.sj@gmail.com wrote:
Here's the Archive's on-demand service:
That would be the most reliable way to set up the partnership emijrp proposes. And it's certainly a good idea. Figuring out how to make it work for almost all editors and make it spam-proof may be interesting.
SJ
On Tue, Aug 24, 2010 at 8:45 PM, Ray Saintonge saintonge@telus.net wrote:
David Gerard wrote:
On 24 August 2010 14:57, emijrp emijrp@gmail.com wrote:
I want to make a proposal about external links preservation. Many times, when you check an external link or a link reference, the website is dead or offline. This websites are important, because they are the sources for the facts showed in the articles. Internet Archive searches for interesting websites to save in their hard disks, so, we can send them our external links sql tables (all projects and languages of course). They improve their database and we always have a copy of the sources text to check when needed. I think that this can be a cool partnership.
+1
Are people who clean up dead links taking the time to check Internet Archive to se if the page in question is there?
Ec
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
-- Samuel Klein identi.ca:sj w:user:sj
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
I actually proposed some form of Wikimedia / IArchive link collaboration some years ago to a friend who worked there at the time; however, they left shortly afterwards.
I like SJ's particular idea. Who has current contacts with Brewster Kahle or someone else over there?
-- -george william herbert george.herbert@gmail.com
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Thanks SJ.
2010/8/25 Samuel Klein meta.sj@gmail.com
I've asked Gordon Mohr @ IA about how to work with archive-it. I will cc: this thread on any response.
SJ
On Tue, Aug 24, 2010 at 8:56 PM, George Herbert george.herbert@gmail.com wrote:
On Tue, Aug 24, 2010 at 5:48 PM, Samuel Klein meta.sj@gmail.com wrote:
Here's the Archive's on-demand service:
That would be the most reliable way to set up the partnership emijrp proposes. And it's certainly a good idea. Figuring out how to make it work for almost all editors and make it spam-proof may be interesting.
SJ
On Tue, Aug 24, 2010 at 8:45 PM, Ray Saintonge saintonge@telus.net
wrote:
David Gerard wrote:
On 24 August 2010 14:57, emijrp emijrp@gmail.com wrote:
I want to make a proposal about external links preservation. Many
times,
when you check an external link or a link reference, the website is
dead or
offline. This websites are important, because they are the sources
for the
facts showed in the articles. Internet Archive searches for
interesting
websites to save in their hard disks, so, we can send them our
external
links sql tables (all projects and languages of course). They improve
their
database and we always have a copy of the sources text to check when
needed.
I think that this can be a cool partnership.
+1
Are people who clean up dead links taking the time to check Internet Archive to se if the page in question is there?
Ec
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
-- Samuel Klein identi.ca:sj w:user:sj
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
I actually proposed some form of Wikimedia / IArchive link collaboration some years ago to a friend who worked there at the time; however, they left shortly afterwards.
I like SJ's particular idea. Who has current contacts with Brewster Kahle or someone else over there?
-- -george william herbert george.herbert@gmail.com
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
-- Samuel Klein identi.ca:sj w:user:sj
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Gordon @ IA was most friendly and helpful. archive-it is a subscription service for focused collections of sites; he had a different idea better suited to our work.
Gordon writes:
Now, given the importance of Wikipedia and editorial significant of things it outlinks-to, perhaps we could set up something specially focused on its content (and the de facto stream of newly-occurring outlinks), that would require no conscious effort by editors but greatly increase the odds that anything linked from Wikipedia would (a few months down the line) also be in our Archive. Is there (or could there be) a feed of all outlinks that IA could crawl almost nonstop?
That sounds excellent to me, if possible (and I think close to what emijrp had in mind!) What would it take to produce such a feed?
SJ
PS - An aside: IA's policies include taking down any links on request, so this would not be a foolproof archive, but a 99% one.
On Tue, Aug 24, 2010 at 9:13 PM, Samuel Klein meta.sj@gmail.com wrote:
I've asked Gordon Mohr @ IA about how to work with archive-it. I will cc: this thread on any response.
SJ
On Tue, Aug 24, 2010 at 8:56 PM, George Herbert george.herbert@gmail.com wrote:
On Tue, Aug 24, 2010 at 5:48 PM, Samuel Klein meta.sj@gmail.com wrote:
Here's the Archive's on-demand service:
That would be the most reliable way to set up the partnership emijrp proposes. And it's certainly a good idea. Figuring out how to make it work for almost all editors and make it spam-proof may be interesting.
SJ
On Tue, Aug 24, 2010 at 8:45 PM, Ray Saintonge saintonge@telus.net wrote:
David Gerard wrote:
On 24 August 2010 14:57, emijrp emijrp@gmail.com wrote:
I want to make a proposal about external links preservation. Many times, when you check an external link or a link reference, the website is dead or offline. This websites are important, because they are the sources for the facts showed in the articles. Internet Archive searches for interesting websites to save in their hard disks, so, we can send them our external links sql tables (all projects and languages of course). They improve their database and we always have a copy of the sources text to check when needed. I think that this can be a cool partnership.
+1
Are people who clean up dead links taking the time to check Internet Archive to se if the page in question is there?
Ec
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
-- Samuel Klein identi.ca:sj w:user:sj
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
I actually proposed some form of Wikimedia / IArchive link collaboration some years ago to a friend who worked there at the time; however, they left shortly afterwards.
I like SJ's particular idea. Who has current contacts with Brewster Kahle or someone else over there?
-- -george william herbert george.herbert@gmail.com
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
-- Samuel Klein identi.ca:sj w:user:sj
Do we send a cc: to Wikimedia tech mailing list?
2010/8/28 Samuel Klein meta.sj@gmail.com
Gordon @ IA was most friendly and helpful. archive-it is a subscription service for focused collections of sites; he had a different idea better suited to our work.
Gordon writes:
Now, given the importance of Wikipedia and editorial significant of
things
it outlinks-to, perhaps we could set up something specially focused on
its
content (and the de facto stream of newly-occurring outlinks), that would require no conscious effort by editors but greatly increase the odds that anything linked from Wikipedia would (a few months down the line) also be in our Archive. Is there (or could there be) a feed of all outlinks that
IA
could crawl almost nonstop?
That sounds excellent to me, if possible (and I think close to what emijrp had in mind!) What would it take to produce such a feed?
SJ
PS - An aside: IA's policies include taking down any links on request, so this would not be a foolproof archive, but a 99% one.
On Tue, Aug 24, 2010 at 9:13 PM, Samuel Klein meta.sj@gmail.com wrote:
I've asked Gordon Mohr @ IA about how to work with archive-it. I will cc: this thread on any response.
SJ
On Tue, Aug 24, 2010 at 8:56 PM, George Herbert george.herbert@gmail.com wrote:
On Tue, Aug 24, 2010 at 5:48 PM, Samuel Klein meta.sj@gmail.com
wrote:
Here's the Archive's on-demand service:
That would be the most reliable way to set up the partnership emijrp proposes. And it's certainly a good idea. Figuring out how to make it work for almost all editors and make it spam-proof may be interesting.
SJ
On Tue, Aug 24, 2010 at 8:45 PM, Ray Saintonge saintonge@telus.net
wrote:
David Gerard wrote:
On 24 August 2010 14:57, emijrp emijrp@gmail.com wrote:
> I want to make a proposal about external links preservation. Many
times,
> when you check an external link or a link reference, the website is
dead or
> offline. This websites are important, because they are the sources
for the
> facts showed in the articles. Internet Archive searches for
interesting
> websites to save in their hard disks, so, we can send them our
external
> links sql tables (all projects and languages of course). They
improve their
> database and we always have a copy of the sources text to check when
needed.
> I think that this can be a cool partnership. > +1
Are people who clean up dead links taking the time to check Internet Archive to se if the page in question is there?
Ec
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe:
https://lists.wikimedia.org/mailman/listinfo/foundation-l
-- Samuel Klein identi.ca:sj w:user:sj
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
I actually proposed some form of Wikimedia / IArchive link collaboration some years ago to a friend who worked there at the time; however, they left shortly afterwards.
I like SJ's particular idea. Who has current contacts with Brewster Kahle or someone else over there?
-- -george william herbert george.herbert@gmail.com
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
-- Samuel Klein identi.ca:sj w:user:sj
-- Samuel Klein identi.ca:sj w:user:sj
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
*What would it take to produce such a feed?** * A real-time feed may or may not be the best idea, for several reasons. - One issue is that every edit would have to be examined not only for external links, but for external links that were not present previously. Doing this real-time may cause slowdowns or additional load for the servers - keep in mind that we would have to scan external links on all edits for all Wikipedia's; Counted together this would result in a very, very busy feed towards IA. - Sometimes added links are spam or otherwise not acceptable, which means they may be removed soon after. In such a case man would prefer not having them archived, since it would be a waste of time and work for IA.
An alternate solution could be forwarding a list of new links every day. The Database Layouthttp://upload.wikimedia.org/wikipedia/commons/4/41/Mediawiki-database-schema.pngfor Wikimedia seems to sugest that all external links are stored in a separate table in the database (And i presume this includes links in reference tags). I wonder if it would be possible to dump this entire table for IA, and afterwards send incremental change packageshttp://en.wikipedia.org/wiki/Changesetto them (Once a day perhaps?). That way they would always have a list of external links used by Wikipedia, and it would decrease the problem with performance hits, spam and links no longer used. If we only forwarded a feed with NEW links IA might end up with a long list of links which are removed over time. And above everything - the External Links table is simply a database table, which should be incredibly easy to read and process for IA, without custom coding required to read and store a feed.
But perhaps the people at the tech mailing list have another \ better idea on how this should work :)
~Excirial
On Sat, Aug 28, 2010 at 9:48 AM, Samuel Klein meta.sj@gmail.com wrote:
Gordon @ IA was most friendly and helpful. archive-it is a subscription service for focused collections of sites; he had a different idea better suited to our work.
Gordon writes:
Now, given the importance of Wikipedia and editorial significant of
things
it outlinks-to, perhaps we could set up something specially focused on
its
content (and the de facto stream of newly-occurring outlinks), that would require no conscious effort by editors but greatly increase the odds that anything linked from Wikipedia would (a few months down the line) also be in our Archive. Is there (or could there be) a feed of all outlinks that
IA
could crawl almost nonstop?
That sounds excellent to me, if possible (and I think close to what emijrp had in mind!) What would it take to produce such a feed?
SJ
PS - An aside: IA's policies include taking down any links on request, so this would not be a foolproof archive, but a 99% one.
On Tue, Aug 24, 2010 at 9:13 PM, Samuel Klein meta.sj@gmail.com wrote:
I've asked Gordon Mohr @ IA about how to work with archive-it. I will cc: this thread on any response.
SJ
On Tue, Aug 24, 2010 at 8:56 PM, George Herbert george.herbert@gmail.com wrote:
On Tue, Aug 24, 2010 at 5:48 PM, Samuel Klein meta.sj@gmail.com
wrote:
Here's the Archive's on-demand service:
That would be the most reliable way to set up the partnership emijrp proposes. And it's certainly a good idea. Figuring out how to make it work for almost all editors and make it spam-proof may be interesting.
SJ
On Tue, Aug 24, 2010 at 8:45 PM, Ray Saintonge saintonge@telus.net
wrote:
David Gerard wrote:
On 24 August 2010 14:57, emijrp emijrp@gmail.com wrote:
> I want to make a proposal about external links preservation. Many
times,
> when you check an external link or a link reference, the website is
dead or
> offline. This websites are important, because they are the sources
for the
> facts showed in the articles. Internet Archive searches for
interesting
> websites to save in their hard disks, so, we can send them our
external
> links sql tables (all projects and languages of course). They
improve their
> database and we always have a copy of the sources text to check when
needed.
> I think that this can be a cool partnership. > +1
Are people who clean up dead links taking the time to check Internet Archive to se if the page in question is there?
Ec
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe:
https://lists.wikimedia.org/mailman/listinfo/foundation-l
-- Samuel Klein identi.ca:sj w:user:sj
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
I actually proposed some form of Wikimedia / IArchive link collaboration some years ago to a friend who worked there at the time; however, they left shortly afterwards.
I like SJ's particular idea. Who has current contacts with Brewster Kahle or someone else over there?
-- -george william herbert george.herbert@gmail.com
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
-- Samuel Klein identi.ca:sj w:user:sj
-- Samuel Klein identi.ca:sj w:user:sj
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
A real time feed wouldn't be a smart idea neither would only new links. New external links are probably the most reliable ones, if they dont work today then theres probably no point in preserving them. Link rot is the biggest problem here, external links which might be 5-6 years old or more. I suggested DeadURL.com because it re-directs to previous versions maintained by other archives after including *deadurl.com/ *in front of the dead link.
Ideally, there should be a way to redirect to older versions of a page through an internal template to include before any dead links. I think that would be the easiest way to implement a change without any technical overhaul.
Theo * * * * On Sun, Aug 29, 2010 at 3:47 AM, Excirial wp.excirial@gmail.com wrote:
*What would it take to produce such a feed?**
A real-time feed may or may not be the best idea, for several reasons.
- One issue is that every edit would have to be examined not only for
external links, but for external links that were not present previously. Doing this real-time may cause slowdowns or additional load for the servers
- keep in mind that we would have to scan external links on all edits for
all Wikipedia's; Counted together this would result in a very, very busy feed towards IA.
- Sometimes added links are spam or otherwise not acceptable, which means
they may be removed soon after. In such a case man would prefer not having them archived, since it would be a waste of time and work for IA.
An alternate solution could be forwarding a list of new links every day. The Database Layout< http://upload.wikimedia.org/wikipedia/commons/4/41/Mediawiki-database-schema...
for
Wikimedia seems to sugest that all external links are stored in a separate table in the database (And i presume this includes links in reference tags). I wonder if it would be possible to dump this entire table for IA, and afterwards send incremental change packageshttp://en.wikipedia.org/wiki/Changesetto them (Once a day perhaps?). That way they would always have a list of external links used by Wikipedia, and it would decrease the problem with performance hits, spam and links no longer used. If we only forwarded a feed with NEW links IA might end up with a long list of links which are removed over time. And above everything - the External Links table is simply a database table, which should be incredibly easy to read and process for IA, without custom coding required to read and store a feed.
But perhaps the people at the tech mailing list have another \ better idea on how this should work :)
~Excirial
On Sat, Aug 28, 2010 at 9:48 AM, Samuel Klein meta.sj@gmail.com wrote:
Gordon @ IA was most friendly and helpful. archive-it is a subscription service for focused collections of sites; he had a different idea better suited to our work.
Gordon writes:
Now, given the importance of Wikipedia and editorial significant of
things
it outlinks-to, perhaps we could set up something specially focused on
its
content (and the de facto stream of newly-occurring outlinks), that
would
require no conscious effort by editors but greatly increase the odds
that
anything linked from Wikipedia would (a few months down the line) also
be
in our Archive. Is there (or could there be) a feed of all outlinks
that
IA
could crawl almost nonstop?
That sounds excellent to me, if possible (and I think close to what emijrp had in mind!) What would it take to produce such a feed?
SJ
PS - An aside: IA's policies include taking down any links on request, so this would not be a foolproof archive, but a 99% one.
On Tue, Aug 24, 2010 at 9:13 PM, Samuel Klein meta.sj@gmail.com wrote:
I've asked Gordon Mohr @ IA about how to work with archive-it. I will cc: this thread on any response.
SJ
On Tue, Aug 24, 2010 at 8:56 PM, George Herbert george.herbert@gmail.com wrote:
On Tue, Aug 24, 2010 at 5:48 PM, Samuel Klein meta.sj@gmail.com
wrote:
Here's the Archive's on-demand service:
That would be the most reliable way to set up the partnership emijrp proposes. And it's certainly a good idea. Figuring out how to make it work for almost all editors and make it spam-proof may be interesting.
SJ
On Tue, Aug 24, 2010 at 8:45 PM, Ray Saintonge saintonge@telus.net
wrote:
David Gerard wrote: > On 24 August 2010 14:57, emijrp emijrp@gmail.com wrote: > >> I want to make a proposal about external links preservation. Many
times,
>> when you check an external link or a link reference, the website
is
dead or
>> offline. This websites are important, because they are the sources
for the
>> facts showed in the articles. Internet Archive searches for
interesting
>> websites to save in their hard disks, so, we can send them our
external
>> links sql tables (all projects and languages of course). They
improve their
>> database and we always have a copy of the sources text to check
when
needed.
>> I think that this can be a cool partnership. >> > +1 > > Are people who clean up dead links taking the time to check Internet Archive to se if the page in question is there?
Ec
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe:
https://lists.wikimedia.org/mailman/listinfo/foundation-l
-- Samuel Klein identi.ca:sj w:user:sj
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe:
https://lists.wikimedia.org/mailman/listinfo/foundation-l
I actually proposed some form of Wikimedia / IArchive link collaboration some years ago to a friend who worked there at the time; however, they left shortly afterwards.
I like SJ's particular idea. Who has current contacts with Brewster Kahle or someone else over there?
-- -george william herbert george.herbert@gmail.com
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe:
https://lists.wikimedia.org/mailman/listinfo/foundation-l
-- Samuel Klein identi.ca:sj w:user:sj
-- Samuel Klein identi.ca:sj w:user:sj
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
*A real time feed wouldn't be a smart idea neither would only new links. New external links are probably the most reliable ones.
*After reading this part i am not entirely certain if you caught my drift correctly, or if you missed it somehow. In case you understood it correctly apologies for stating it again.* *What i intended to say was that we should forward a completely list of all external links to IA, in the form of the External Links database table. This means that all current links would be known to IA, and could therefor be checked. Because such a table would be large to transfer, we could opt to forward the changes to it once a day, which would result in a lot less data traffic having to be send. In other words - IA would have a complete list of all external links on Wikipedia, and that list would be updated once a day (Removing all links no longer used, while equally adding links added that day). * Ideally, there should be a way to redirect to older versions of a page* *through an internal template to include before any dead links. I think that would be the easiest way to implement a change without any technical overhaul. * Keep in mind that this partnership suggestion seems to focus on this: *Greatly increase the odds that anything linked from Wikipedia would also be in our Archive*. The feed towards them is simply a means to flag "Important" pages so that they are crawled more often, or at least crawled once they are reported which would increase their change of being saved in the archive. How we subsequently handle this stored data is another, but still different concern. Even so we do have several related templates already (Herehttp://en.wikipedia.org/wiki/Template:Deadlink).
~Excirial* * On Sun, Aug 29, 2010 at 1:00 AM, theo10011 de10011@gmail.com wrote:
A real time feed wouldn't be a smart idea neither would only new links. New external links are probably the most reliable ones, if they dont work today then theres probably no point in preserving them. Link rot is the biggest problem here, external links which might be 5-6 years old or more. I suggested DeadURL.com because it re-directs to previous versions maintained by other archives after including *deadurl.com/ *in front of the dead link.
Ideally, there should be a way to redirect to older versions of a page through an internal template to include before any dead links. I think that would be the easiest way to implement a change without any technical overhaul.
Theo
On Sun, Aug 29, 2010 at 3:47 AM, Excirial wp.excirial@gmail.com wrote:
*What would it take to produce such a feed?**
A real-time feed may or may not be the best idea, for several reasons.
- One issue is that every edit would have to be examined not only for
external links, but for external links that were not present previously. Doing this real-time may cause slowdowns or additional load for the
servers
- keep in mind that we would have to scan external links on all edits for
all Wikipedia's; Counted together this would result in a very, very busy feed towards IA.
- Sometimes added links are spam or otherwise not acceptable, which means
they may be removed soon after. In such a case man would prefer not
having
them archived, since it would be a waste of time and work for IA.
An alternate solution could be forwarding a list of new links every day. The Database Layout<
http://upload.wikimedia.org/wikipedia/commons/4/41/Mediawiki-database-schema...
for
Wikimedia seems to sugest that all external links are stored in a separate table in the database (And i presume this includes links in reference tags). I wonder if it would be possible to dump this entire
table
for IA, and afterwards send incremental change packageshttp://en.wikipedia.org/wiki/Changesetto them (Once a day perhaps?). That way they would always have a list of external links used by Wikipedia, and it would decrease the problem with performance hits, spam and links no longer used. If we only forwarded a feed with NEW links IA might end up with a long list of links which are
removed
over time. And above everything - the External Links table is simply a database table, which should be incredibly easy to read and process for
IA,
without custom coding required to read and store a feed.
But perhaps the people at the tech mailing list have another \ better
idea
on how this should work :)
~Excirial
On Sat, Aug 28, 2010 at 9:48 AM, Samuel Klein meta.sj@gmail.com wrote:
Gordon @ IA was most friendly and helpful. archive-it is a subscription service for focused collections of sites; he had a different idea better suited to our work.
Gordon writes:
Now, given the importance of Wikipedia and editorial significant of
things
it outlinks-to, perhaps we could set up something specially focused
on
its
content (and the de facto stream of newly-occurring outlinks), that
would
require no conscious effort by editors but greatly increase the odds
that
anything linked from Wikipedia would (a few months down the line)
also
be
in our Archive. Is there (or could there be) a feed of all outlinks
that
IA
could crawl almost nonstop?
That sounds excellent to me, if possible (and I think close to what emijrp had in mind!) What would it take to produce such a feed?
SJ
PS - An aside: IA's policies include taking down any links on request, so this would not be a foolproof archive, but a 99% one.
On Tue, Aug 24, 2010 at 9:13 PM, Samuel Klein meta.sj@gmail.com
wrote:
I've asked Gordon Mohr @ IA about how to work with archive-it. I
will
cc: this thread on any response.
SJ
On Tue, Aug 24, 2010 at 8:56 PM, George Herbert george.herbert@gmail.com wrote:
On Tue, Aug 24, 2010 at 5:48 PM, Samuel Klein meta.sj@gmail.com
wrote:
Here's the Archive's on-demand service:
That would be the most reliable way to set up the partnership
emijrp
proposes. And it's certainly a good idea. Figuring out how to
make
it work for almost all editors and make it spam-proof may be interesting.
SJ
On Tue, Aug 24, 2010 at 8:45 PM, Ray Saintonge <
saintonge@telus.net>
wrote:
> David Gerard wrote: >> On 24 August 2010 14:57, emijrp emijrp@gmail.com wrote: >> >>> I want to make a proposal about external links preservation.
Many
times,
>>> when you check an external link or a link reference, the website
is
dead or
>>> offline. This websites are important, because they are the
sources
for the
>>> facts showed in the articles. Internet Archive searches for
interesting
>>> websites to save in their hard disks, so, we can send them our
external
>>> links sql tables (all projects and languages of course). They
improve their
>>> database and we always have a copy of the sources text to check
when
needed.
>>> I think that this can be a cool partnership. >>> >> +1 >> >> > Are people who clean up dead links taking the time to check
Internet
> Archive to se if the page in question is there? > > > Ec > > _______________________________________________ > foundation-l mailing list > foundation-l@lists.wikimedia.org > Unsubscribe:
https://lists.wikimedia.org/mailman/listinfo/foundation-l
>
-- Samuel Klein identi.ca:sj w:user:sj
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe:
https://lists.wikimedia.org/mailman/listinfo/foundation-l
I actually proposed some form of Wikimedia / IArchive link collaboration some years ago to a friend who worked there at the
time;
however, they left shortly afterwards.
I like SJ's particular idea. Who has current contacts with Brewster Kahle or someone else over there?
-- -george william herbert george.herbert@gmail.com
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe:
https://lists.wikimedia.org/mailman/listinfo/foundation-l
-- Samuel Klein identi.ca:sj w:user:sj
-- Samuel Klein identi.ca:sj w:user:sj
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
A real-time feed of external links is overkill. As mentioned by others, the chief problem is linkrot of old links. All we need to do is dump the contents of externallinks.el_to from the database once a year or so, run a hex to ASCII conversion on it, zip it, and email it to someone at the Internet Archive. Anyone with access to the databases should be able to do this fairly easily. Rather than trying to engineer a complicated system that will take a year to implement, why not take this simple approach that will take care of 90+% of the problem?
Ryan Kaldari
A dump "once a year or so" is not enough: the average life span of a website is 3 months. Kind regards, Dodoïste
2010/8/31 Ryan Kaldari rkaldari@wikimedia.org
A real-time feed of external links is overkill. As mentioned by others, the chief problem is linkrot of old links. All we need to do is dump the contents of externallinks.el_to from the database once a year or so, run a hex to ASCII conversion on it, zip it, and email it to someone at the Internet Archive. Anyone with access to the databases should be able to do this fairly easily. Rather than trying to engineer a complicated system that will take a year to implement, why not take this simple approach that will take care of 90+% of the problem?
Ryan Kaldari
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Tue, Aug 31, 2010 at 5:31 PM, Ryan Kaldari rkaldari@wikimedia.org wrote:
A real-time feed of external links is overkill. As mentioned by others, the chief problem is linkrot of old links. All we need to do is dump the contents of externallinks.el_to from the database once a year or so, run a hex to ASCII conversion on it, zip it, and email it to someone at the Internet Archive. Anyone with access to the databases should be able to do this fairly easily. Rather than trying to engineer a complicated system that will take a year to implement, why not take this simple approach that will take care of 90+% of the problem?
Ryan Kaldari
Why once a year? We already get a successful externallinks dump every dump cycle. Even the enwiki one is only half a month old[0]. If someone wants to work with Internet Archive or anyone else on this, the data is already there.
-Chad
On 25.08.2010 02:45, Ray Saintonge wrote:
Are people who clean up dead links taking the time to check Internet Archive to se if the page in question is there?
No one should even touch a presumed dead link unless he or she has the expertise to check the link for a simple restructuring on the linked site and do a search in the internet archive. Personally I consider removing "dead links" without such a check as a very dangerous form of vandalism.
Ciao Henning
On 08/24/2010 03:57 PM, emijrp wrote:
I want to make a proposal about external links preservation. Many times, when you check an external link or a link reference, the website is dead or offline. This websites are important, because they are the sources for the facts showed in the articles. Internet Archive searches for interesting websites to save in their hard disks, so, we can send them our external links sql tables (all projects and languages of course). They improve their database and we always have a copy of the sources text to check when needed.
I wanted to suggest this for a long time. I see two more reasons for this:
- We are often copying free images or text from various sites (for example flickr but other ones too). It happens that these sites go offline or change their licenses later. Having such an archive, archived by an independent organization, would be indisputable proof of copyright status.
- Wikipedia often writes articles about current events, and these link to various news organizations as sources. It happens sometimes that these sources stealthily change their content for various reasons. Such an archive, if it would be able to quickly follow Wikipedia's new links, would be a strong deterrent against this Orwellian trend.
Nikola Smolenski wrote:
I wanted to suggest this for a long time. I see two more reasons for this:
- We are often copying free images or text from various sites (for
example flickr but other ones too). It happens that these sites go offline or change their licenses later. Having such an archive, archived by an independent organization, would be indisputable proof of copyright status.
Personally I wouldn't rely on a flickr CC license as being in any way reliable. http://commons.wikimedia.org/wiki/Commons:Flickr_washing
I've seen too many AP photographs cropped to remove the AP attribute and uploaded to flickr as CC-BY to accept a flickr CC license at face value. In most cases the person doing so is probably taking stuff already cropped, and probably believes that if it is on the internet its public domain.
No university, publisher, or newspaper has used my CC licensed images either commercially or non-commercially without checking with me first that the work is actually CC licensed. They have always carried out some form of due diligence to ascertain that the image is either licensed properly, and that they get a specific license to reuse. IOW they obtain a 'paper trail' of permission.
- Wikipedia often writes articles about current events, and these link
to various news organizations as sources. It happens sometimes that these sources stealthily change their content for various reasons. Such an archive, if it would be able to quickly follow Wikipedia's new links, would be a strong deterrent against this Orwellian trend.
If someone is making copies of web pages that is a copyright violation. Unless they have, in the US, specific exemption from the US Copyright Office, that can lead to some heavy legal issues. The internet archive happens to have limited permissions on obsolete games and software, but otherwise it respects copyright and robots.txt and applies the directives retroactively.
http://www.archive.org/about/terms.php http://en.wikipedia.org/wiki/Internet_Archive#Healthcare_Advocates.2C_Inc. http://web.archive.org/web/20020923133856rn_1/www.nytimes.com/auth/login?URI...
2010/8/24 wiki-list@phizz.demon.co.uk
Nikola Smolenski wrote:
I wanted to suggest this for a long time. I see two more reasons for
this:
- We are often copying free images or text from various sites (for
example flickr but other ones too). It happens that these sites go offline or change their licenses later. Having such an archive, archived by an independent organization, would be indisputable proof of copyright status.
The internet archive [...] respects copyright and robots.txt and applies the directives retroactively.
And so does Wikiwix. Wikiwix do respect robots.txt. Linterweb's advocate certified there is no legal issue with Wikiwix.
Regards, Dodoïste
Дана Tuesday 24 August 2010 21:05:05 wiki-list@phizz.demon.co.uk написа:
Nikola Smolenski wrote:
I wanted to suggest this for a long time. I see two more reasons for this:
- We are often copying free images or text from various sites (for
example flickr but other ones too). It happens that these sites go offline or change their licenses later. Having such an archive, archived by an independent organization, would be indisputable proof of copyright status.
Personally I wouldn't rely on a flickr CC license as being in any way reliable. http://commons.wikimedia.org/wiki/Commons:Flickr_washing
I've seen too many AP photographs cropped to remove the AP attribute and uploaded to flickr as CC-BY to accept a flickr CC license at face value. In most cases the person doing so is probably taking stuff already cropped, and probably believes that if it is on the internet its public domain.
That is another issue entirely. And in order to determine if an image has been washed in such a way and who did it you have to know its origin.
No university, publisher, or newspaper has used my CC licensed images either commercially or non-commercially without checking with me first that the work is actually CC licensed. They have always carried out some
If the original website is gone, they can't even call to check.
- Wikipedia often writes articles about current events, and these link
to various news organizations as sources. It happens sometimes that these sources stealthily change their content for various reasons. Such an archive, if it would be able to quickly follow Wikipedia's new links, would be a strong deterrent against this Orwellian trend.
If someone is making copies of web pages that is a copyright violation. Unless they have, in the US, specific exemption from the US Copyright Office, that can lead to some heavy legal issues. The internet archive
It appears that so far this has not been a problem in practice, and anyway if they are willing to take the risk, who are we to stop them?
On Tue, Aug 24, 2010 at 12:05 PM, wiki-list@phizz.demon.co.uk wrote: <snip>
No university, publisher, or newspaper has used my CC licensed images either commercially or non-commercially without checking with me first that the work is actually CC licensed. They have always carried out some form of due diligence to ascertain that the image is either licensed properly, and that they get a specific license to reuse. IOW they obtain a 'paper trail' of permission.
Good for you. Most professional publishers do make every effort to carry out due diligence. However, I have had a several cases where well known newspapers and magazines appropriated my images without attempting to contact me (and in some cases even without providing any attribution). It does leave me to wonder how many other times my images might have been used professionally and/or improperly and I just don't know about it.
-Robert Rohde
This looks like a solid organization. Solid in the sense that it wont go suddenly offline.
Such links may be valuable for: - article references to sources, in case the source goes offline - article references to sources, in case thesource changes its content - media copies when the source changes or removes a license
I took a look at the example in the french wiki, and didnt spot a date in the archive reference. If the source changes its content, this may pose a problem.
kind regards, teun
On Tue, Aug 24, 2010 at 3:57 PM, emijrp emijrp@gmail.com wrote:
Hi all;
I want to make a proposal about external links preservation. Many times, when you check an external link or a link reference, the website is dead or offline. This websites are important, because they are the sources for the facts showed in the articles. Internet Archive searches for interesting websites to save in their hard disks, so, we can send them our external links sql tables (all projects and languages of course). They improve their database and we always have a copy of the sources text to check when needed.
I think that this can be a cool partnership.
Regards, emijrp _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Dear teun,
Linterweb is working around the wikipedia before 2007 : http://en.wikipedia.org/wiki/Wikipedia So i am sure we will be online in the futur :)
Looks like: http://wikiwix.com http://okawix.com
Not a problem, if the source changes there is no change in our cache, it is what want the french community.
Now we are working in feature , to add a parameter in the link wich needs frequently update.
Sincerely Pascal
----- Original Message ----- From: "teun spaans" teun.spaans@gmail.com To: "Wikimedia Foundation Mailing List" foundation-l@lists.wikimedia.org Sent: Tuesday, August 24, 2010 4:47 PM Subject: Re: [Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive
This looks like a solid organization. Solid in the sense that it wont go suddenly offline.
Such links may be valuable for:
- article references to sources, in case the source goes offline
- article references to sources, in case thesource changes its content
- media copies when the source changes or removes a license
I took a look at the example in the french wiki, and didnt spot a date in the archive reference. If the source changes its content, this may pose a problem.
kind regards, teun
On Tue, Aug 24, 2010 at 3:57 PM, emijrp emijrp@gmail.com wrote:
Hi all;
I want to make a proposal about external links preservation. Many times, when you check an external link or a link reference, the website is dead or offline. This websites are important, because they are the sources for the facts showed in the articles. Internet Archive searches for interesting websites to save in their hard disks, so, we can send them our external links sql tables (all projects and languages of course). They improve their database and we always have a copy of the sources text to check when needed.
I think that this can be a cool partnership.
Regards, emijrp _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Hi,
On Tue, Aug 24, 2010 at 4:47 PM, teun spaans teun.spaans@gmail.com wrote:
I took a look at the example in the french wiki, and didnt spot a date in the archive reference. If the source changes its content, this may pose a problem.
If you click on an archive link the top frame will display the exact date
of the archiving - I think the reason it is not displayed by default on the French Wikipedia is because the archive links are generated by JavaScript on the fly. (At least that was the case the last time I looked at the French Wikipedia.)
Having the ability to store multiple copies of the same webpage (for different dates) was one of the first feature requests we had at the Hungarian Wikipedia and it seems they are working on it. Still, Wikiwix's service is very convenient and hassle free for all the static websites or references.
Webcitation.org also has a service for on-demand archiving and they do store multiple versions of the same page. Unfortunately their service is often intermittent and their website tends to go dark, but otherwise it is a convenient service for manual archiving. (I had a bot once that sent each link through its service on the Hungarian Wikipedia, and for a time the English Wikipedia had a similar bot.
Best regards, Bence
We need a trusted and reliable archiving project (and perhaps mirrors). Webcite or Wikiwix can be great projects, but, how long will they be in the web? Until 2012? 2015? 2020?
The average life of a website is 77 days[1], and we see dead links everywhere in Wikipedia articles. This is a big problem, and not only for Wikipedia, Internet builds and destroyes information too fast.
Internet Archive is a nonprofit foundation, and it is running since 1996, so I think that it is a stable project and they are going to create mirrors in more countries (now there is a mirror in Alexandria). But, of course, Webcite or Wikiwix can help storing web copies (3 different archiving projects are better than only 1).
Regards, emijrp
[1] http://www.archive.org/about/faqs.php
2010/8/24 Bence Damokos bdamokos@gmail.com
Hi,
On Tue, Aug 24, 2010 at 4:47 PM, teun spaans teun.spaans@gmail.com wrote:
I took a look at the example in the french wiki, and didnt spot a date in the archive reference. If the source changes its content, this may pose a problem.
If you click on an archive link the top frame will display the exact date
of the archiving - I think the reason it is not displayed by default on the French Wikipedia is because the archive links are generated by JavaScript on the fly. (At least that was the case the last time I looked at the French Wikipedia.)
Having the ability to store multiple copies of the same webpage (for different dates) was one of the first feature requests we had at the Hungarian Wikipedia and it seems they are working on it. Still, Wikiwix's service is very convenient and hassle free for all the static websites or references.
Webcitation.org also has a service for on-demand archiving and they do store multiple versions of the same page. Unfortunately their service is often intermittent and their website tends to go dark, but otherwise it is a convenient service for manual archiving. (I had a bot once that sent each link through its service on the Hungarian Wikipedia, and for a time the English Wikipedia had a similar bot.
Best regards, Bence _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On 24 August 2010 17:32, emijrp emijrp@gmail.com wrote:
Internet Archive is a nonprofit foundation, and it is running since 1996, so I think that it is a stable project and they are going to create mirrors in more countries (now there is a mirror in Alexandria). But, of course, Webcite or Wikiwix can help storing web copies (3 different archiving projects are better than only 1).
That's a key point: have multiple archives easily supported. (Hopefully not as complicated as what happens when you click on an ISBN.)
- d.
Its a great idea, using the wayback machine to ward of link rot. I support it but doesn't Google cache offer a similar service. there is also deadURL.com which uses Google Cache, the Internet Archive, and user submissions for gathering dead links.
I would guess that Google Cache would have the highest and the longest reliability, at least as long as Google exists, its their business.
Regards
Theo
On Tue, Aug 24, 2010 at 10:07 PM, David Gerard dgerard@gmail.com wrote:
On 24 August 2010 17:32, emijrp emijrp@gmail.com wrote:
Internet Archive is a nonprofit foundation, and it is running since 1996,
so
I think that it is a stable project and they are going to create mirrors
in
more countries (now there is a mirror in Alexandria). But, of course, Webcite or Wikiwix can help storing web copies (3 different archiving projects are better than only 1).
That's a key point: have multiple archives easily supported. (Hopefully not as complicated as what happens when you click on an ISBN.)
- d.
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Google's cache only last a handful of days. It's very useful if the Website is down temporarily (for maintenance or overload for example) but totally useless otherwise.
Kind Regards, Dodoïste
2010/8/24 theo10011 de10011@gmail.com
Its a great idea, using the wayback machine to ward of link rot. I support it but doesn't Google cache offer a similar service. there is also deadURL.com which uses Google Cache, the Internet Archive, and user submissions for gathering dead links.
I would guess that Google Cache would have the highest and the longest reliability, at least as long as Google exists, its their business.
Regards
Theo
Rodan Bury wrote:
Google's cache only last a handful of days. It's very useful if the Website is down temporarily (for maintenance or overload for example) but totally useless otherwise.
Even if Google Cache were less ephemeral, it's hard to avoid the soupçon of evil that permeates Google.
Ec
2010/8/24 theo10011 de10011@gmail.com
Its a great idea, using the wayback machine to ward of link rot. I support it but doesn't Google cache offer a similar service. there is also deadURL.com which uses Google Cache, the Internet Archive, and user submissions for gathering dead links.
I would guess that Google Cache would have the highest and the longest reliability, at least as long as Google exists, its their business.
Does anyone know what the status of the Internet Archive is with respect to being a practical ongoing concern?
In the last couple years IA has added relatively little web-based content.
For example, their Wayback Machine currently offers:
www.nytimes.com: 11 pages since 2006 en.wikipedia.org: 5 pages since 2008 www.nasa.gov: 12 pages since 2008 scienceblogs.com: 0 pages since 2008
It gives the impression that they are so ineffective at archiving recent content as to be effectively irrelevant. They do have a warning that it can take 6 or more months for newly accessed content to be incorporated into their database, but at this point the delay has been significantly more than that. Even at their peak they rarely archived more than a few hundred pages per major domain per year, which still amounts to a tiny fraction of the internet.
The idea of seeking collaborations with people that archive web content is a good one, but I don't know that IA is really in a position to be all that useful.
-Robert Rohde
On Tue, Aug 24, 2010 at 6:57 AM, emijrp emijrp@gmail.com wrote:
Hi all;
I want to make a proposal about external links preservation. Many times, when you check an external link or a link reference, the website is dead or offline. This websites are important, because they are the sources for the facts showed in the articles. Internet Archive searches for interesting websites to save in their hard disks, so, we can send them our external links sql tables (all projects and languages of course). They improve their database and we always have a copy of the sources text to check when needed.
I think that this can be a cool partnership.
Regards, emijrp _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
* It gives the impression that they are so ineffective at archiving recent content as to be effectively irrelevant.*
Your not the only person asking that question, have a look at this FAQ entryhttp://www.archive.org/about/faqs.php#103and This forum posthttp://www.archive.org/post/320741/large-site-with-no-entries-at-all-for-2008-2009-2010. To specifically quote the FAQ: **
* It generally takes 6 months or more (up to 24 months) for pages to appear in the Wayback Machine after they are collected, because of delays in transferring material to long-term storage and indexing, or the requirements of our collection partners. *
*In some cases, crawled content from certain projects can appear in a much shorter timeframe — as little as a few weeks from when it was crawled. Older material for the same pages and sites may still appear separately, months later. *
*There is no access to files before they appear in the Wayback Machine. *
* Even at their peak they rarely archived more than a few hundred pages per major domain per year, which still amounts to a tiny fraction of the internet* Keep in mind that sub-pages are indexed separately. For example the Administrators noticeboard http://web.archive.org/web/*/http://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboardand blocking policyhttp://web.archive.org/web/*/http://en.wikipedia.org/wiki/Wikipedia:Blocking_policyare indexed at least several times a year. Equally keep in mind that the reliable sources we use rarely change content on a later date. A news article published in a news paper is static, and most news article's posted are equally static (With one or two updates before being moved from the main page). As of such we don't need a high interval for updates - a single back link is often more then sufficient for referencing purposes, since we aren't keeping a revision history for sources.
~Excirial
On Tue, Aug 24, 2010 at 9:50 PM, Robert Rohde rarohde@gmail.com wrote:
Does anyone know what the status of the Internet Archive is with respect to being a practical ongoing concern?
In the last couple years IA has added relatively little web-based content.
For example, their Wayback Machine currently offers:
www.nytimes.com: 11 pages since 2006 en.wikipedia.org: 5 pages since 2008 www.nasa.gov: 12 pages since 2008 scienceblogs.com: 0 pages since 2008
It gives the impression that they are so ineffective at archiving recent content as to be effectively irrelevant. They do have a warning that it can take 6 or more months for newly accessed content to be incorporated into their database, but at this point the delay has been significantly more than that. Even at their peak they rarely archived more than a few hundred pages per major domain per year, which still amounts to a tiny fraction of the internet.
The idea of seeking collaborations with people that archive web content is a good one, but I don't know that IA is really in a position to be all that useful.
-Robert Rohde
On Tue, Aug 24, 2010 at 6:57 AM, emijrp emijrp@gmail.com wrote:
Hi all;
I want to make a proposal about external links preservation. Many times, when you check an external link or a link reference, the website is dead
or
offline. This websites are important, because they are the sources for
the
facts showed in the articles. Internet Archive searches for interesting websites to save in their hard disks, so, we can send them our external links sql tables (all projects and languages of course). They improve
their
database and we always have a copy of the sources text to check when
needed.
I think that this can be a cool partnership.
Regards, emijrp _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
wikimedia-l@lists.wikimedia.org