Brion Vibber schrieb:
Magnus Manske wrote:
That
would be way too expensive, no can do.
"SELECT DISTINCT rev_user,rev_user_text FROM revision" for a
single page
is too expensive? Come on, article history is probably more expensive
than that!
Well, we've got this site called "Wikipedia", which has a lot of pages with
tens
of thousands of revisions on them. ;)
Whenever we have a feature that allows going wild on all of them and allow
people to use it, it bogs down the databases until we have to remove it. You
need to take these into account *first*.
If you had read my other mails (about an hour before you send yours),
you'd have seen that there is already such a thing, via the page
history. I could probably demonstrate by requesting the history of GWB
with limit=100000 a few dozen times simultanously, but I won't for
obvious reasons.
Could you do me a favor? Run a SELECT DISTINCT on all authors of GWB
manually and take the time. This should represent the worst-case
scenario. I'm curious about how long that takes.
There are ways to limit this. AFAIK, it is not (strictly) necessary to
list IPs in the author list, so one could add "WHERE rev_user_id>0" to
the query. That should take care of the anon vandals, which in the case
of GWB should be quite a percentage.
It might also be possible to put an absolute limit into the query. But
that is a question the the legal guys.
Brion, please note that this is not about my latest cute little script.
AFAIK, there is currently *no* way to publish the GWB article as
demanded by the GFDL short of downloading the *entire* en.wikipedia
including all old revisions, installing MediaWiki, importing the whole
thing and then run the query locally. Not exactly what I'd call
user-friendly...
Magnus