[Foundation-l] A proposal of partnership between Wikimedia Foundation and Internet Archive

Sat Aug 28 22:17:46 UTC 2010

*What would it take to produce such a feed?**
*
A real-time feed may or may not be the best idea, for several reasons.
- One issue is that every edit would have to be examined not only for
external links, but for external links that were not present previously.
Doing this real-time may cause slowdowns or additional load for the servers
- keep in mind that we would have to scan external links on all edits for
all Wikipedia's; Counted together this would result in a very, very busy
feed towards IA.
- Sometimes added links are spam or otherwise not acceptable, which means
they may be removed soon after. In such a case man would prefer not having
them archived, since it would be a waste of time and work for IA.

An alternate solution could be forwarding a list of new links every day. The
Database Layout<http://upload.wikimedia.org/wikipedia/commons/4/41/Mediawiki-database-schema.png>for
Wikimedia seems to sugest that all external links are stored in a
separate table in the database (And i presume this includes links in
reference tags). I wonder if it would be possible to dump this entire table
for IA, and afterwards send incremental change
packages<http://en.wikipedia.org/wiki/Changeset>to them (Once a day
perhaps?). That way they would always have a list of
external links used by Wikipedia, and it would decrease the problem with
performance hits, spam and links no longer used. If we only forwarded a feed
with NEW links IA might end up with a long list of links which are removed
over time. And above everything - the External Links table is simply a
database table, which should be incredibly easy to read and process for IA,
without custom coding required to read and store a feed.

But perhaps the people at the tech mailing list have another \ better idea
on how this should work :)

~Excirial

On Sat, Aug 28, 2010 at 9:48 AM, Samuel Klein <meta.sj at gmail.com> wrote:

> Gordon @ IA was most friendly and helpful.  archive-it is a
> subscription service for focused collections of sites; he had a
> different idea better suited to our work.
>
> Gordon writes:
> > Now, given the importance of Wikipedia and editorial significant of
> things
> > it outlinks-to, perhaps we could set up something specially focused on
> its
> > content (and the de facto stream of newly-occurring outlinks), that would
> > require no conscious effort by editors but greatly increase the odds that
> > anything linked from Wikipedia would (a few months down the line) also be
> > in our Archive. Is there (or could there be) a feed of all outlinks that
> IA
> > could crawl almost nonstop?
>
> That sounds excellent to me, if possible (and I think close to what
> emijrp had in mind!)  What would it take to produce such a feed?
>
> SJ
>
> PS - An aside: IA's policies include taking down any links on request,
> so this would not be a foolproof archive, but a 99% one.
>
>
> On Tue, Aug 24, 2010 at 9:13 PM, Samuel Klein <meta.sj at gmail.com> wrote:
> > I've asked Gordon Mohr @ IA about how to work with archive-it.  I will
> > cc: this thread on any response.
> >
> > SJ
> >
> > On Tue, Aug 24, 2010 at 8:56 PM, George Herbert
> > <george.herbert at gmail.com> wrote:
> >> On Tue, Aug 24, 2010 at 5:48 PM, Samuel Klein <meta.sj at gmail.com>
> wrote:
> >>> Here's the Archive's on-demand service:
> >>>
> >>> http://archive-it.org
> >>>
> >>> That would be the most reliable way to set up the partnership emijrp
> >>> proposes.  And it's certainly a good idea.  Figuring out how to make
> >>> it work for almost all editors and make it spam-proof may be
> >>> interesting.
> >>>
> >>> SJ
> >>>
> >>>
> >>>
> >>> On Tue, Aug 24, 2010 at 8:45 PM, Ray Saintonge <saintonge at telus.net>
> wrote:
> >>>> David Gerard wrote:
> >>>>> On 24 August 2010 14:57, emijrp <emijrp at gmail.com> wrote:
> >>>>>
> >>>>>> I want to make a proposal about external links preservation. Many
> times,
> >>>>>> when you check an external link or a link reference, the website is
> dead or
> >>>>>> offline. This websites are important, because they are the sources
> for the
> >>>>>> facts showed in the articles. Internet Archive searches for
> interesting
> >>>>>> websites to save in their hard disks, so, we can send them our
> external
> >>>>>> links sql tables (all projects and languages of course). They
> improve their
> >>>>>> database and we always have a copy of the sources text to check when
> needed.
> >>>>>> I think that this can be a cool partnership.
> >>>>>>
> >>>>> +1
> >>>>>
> >>>>>
> >>>> Are people who clean up dead links taking the time to check Internet
> >>>> Archive to se if the page in question is there?
> >>>>
> >>>>
> >>>> Ec
> >>>>
> >>>> _______________________________________________
> >>>> foundation-l mailing list
> >>>> foundation-l at lists.wikimedia.org
> >>>> Unsubscribe:
> https://lists.wikimedia.org/mailman/listinfo/foundation-l
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Samuel Klein          identi.ca:sj           w:user:sj
> >>>
> >>> _______________________________________________
> >>> foundation-l mailing list
> >>> foundation-l at lists.wikimedia.org
> >>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
> >>
> >>
> >> I actually proposed some form of Wikimedia / IArchive link
> >> collaboration some years ago to a friend who worked there at the time;
> >> however, they left shortly afterwards.
> >>
> >> I like SJ's particular idea.  Who has current contacts with Brewster
> >> Kahle or someone else over there?
> >>
> >>
> >> --
> >> -george william herbert
> >> george.herbert at gmail.com
> >>
> >> _______________________________________________
> >> foundation-l mailing list
> >> foundation-l at lists.wikimedia.org
> >> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
> >>
> >>
> >
> >
> >
> > --
> > Samuel Klein          identi.ca:sj           w:user:sj
> >
>
>
>
> --
> Samuel Klein          identi.ca:sj           w:user:sj
>
> _______________________________________________
> foundation-l mailing list
> foundation-l at lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>