[Foundation-l] Baidupedia copyvio collections

Thu Jun 12 20:07:46 UTC 2008

2008/6/12 Titan Deng <theodoranian at gmail.com>:

>> Mmm... this may work. Finding five main authors is so tricky that we
>> usually recommend a link to the wp history page, though - and a link
>> to a blocked site is pretty useless in terms of actually giving
>> attribution!
>
> I think legally speaking it's not our responsibility to find ways for them
> to give attribution to the authors.  It is not reasonable they use those
> articles and at the same time they need us to provide legal ways to them.
> Just too over.

I think if we want them to provide attribution, it is a good thing for
us to try and make that process as easy and efficient for them as
possible. With normal mirrors - ones that don't operate behind the
Great Firewall - we have a pretty good record of getting attribution
sorted out, because we can email them and say very clearly and simply
what they need to do - and because it's painless, they can do it
without it costing them anything.

If we demanded those mirrors do a lot of work, on the other hand, we'd
get a much lower success rate.

The three obvious options for giving attribution:

a) Do what everyone else does, and link to the Wikipedia article
histories. Except that's meaningless for most of the readers - the
vast majority of them who live in mainland China won't be able to
follow the link, and the GFDL probably frowns a bit on a list of
authors which you aren't allowed to see...

b) Say "is taken from Wikipedia", or "copyright Wikipedia", without
the link, but this is in violation of the GFDL, just in a different
way.

c) Import full Wikipedia histories - thus giving attribution. However,
this runs into problems in that it provides a vast amount of new
material needing vetted, and so means a lot more editorial oversight
is needed from Baidu. Probably expensive.

c) Figure out main authors for each and every article, and attribute
them (without links?) in the Baidu articles. Aha, problem solved.

This last one is obviously the best option, but how would they get
those main authors? Working them all out by hand is incredibly
time-consuming when you have an even moderately long article, so for
it to be practical we need some way of generating them en masse.

It's a thorny problem even for us, and you'd expect us to be the
experts - we've tried before and never really found a method that's
reliable. If we want Baidu to do something like this, we'd stand a
much better chance if we can find some way of generating those authors
for them in advance, or identify an easy method they can use to do
so.*

If we just say "well, they can sort it out themselves, but they ought
to do something", it strikes me that we're going to just make it less
likely the problem ever gets fixed.

-- 
- Andrew Gray
 andrew.gray at dunelm.org.uk

* You know what would be really cool? Some kind of API that takes a
pagename or revision ID, crunches the article history, and spits back
five major authors.