Brion Vibber schrieb:
Magnus Manske wrote:
That would be way too expensive, no can do.
"SELECT DISTINCT rev_user,rev_user_text FROM revision" for a single page is too expensive? Come on, article history is probably more expensive than that!
Well, we've got this site called "Wikipedia", which has a lot of pages with tens of thousands of revisions on them. ;)
Whenever we have a feature that allows going wild on all of them and allow people to use it, it bogs down the databases until we have to remove it. You need to take these into account *first*.
If you had read my other mails (about an hour before you send yours), you'd have seen that there is already such a thing, via the page history. I could probably demonstrate by requesting the history of GWB with limit=100000 a few dozen times simultanously, but I won't for obvious reasons.
Could you do me a favor? Run a SELECT DISTINCT on all authors of GWB manually and take the time. This should represent the worst-case scenario. I'm curious about how long that takes.
There are ways to limit this. AFAIK, it is not (strictly) necessary to list IPs in the author list, so one could add "WHERE rev_user_id>0" to the query. That should take care of the anon vandals, which in the case of GWB should be quite a percentage.
It might also be possible to put an absolute limit into the query. But that is a question the the legal guys.
Brion, please note that this is not about my latest cute little script. AFAIK, there is currently *no* way to publish the GWB article as demanded by the GFDL short of downloading the *entire* en.wikipedia including all old revisions, installing MediaWiki, importing the whole thing and then run the query locally. Not exactly what I'd call user-friendly...
Magnus