Magnus,
history. I could probably demonstrate by requesting the history of GWB with limit=100000 a few dozen times simultanously, but I won't for obvious reasons.
Instead of writing additional feature for that you could just restrict limit on page history :)
Could you do me a favor? Run a SELECT DISTINCT on all authors of GWB manually and take the time. This should represent the worst-case scenario. I'm curious about how long that takes.
Worst case scenario is that revision would have to be loaded into memory, instead of safe haven on our slow i/o :) That would mean that we'd need twice more memory for all DB servers.
There are ways to limit this. AFAIK, it is not (strictly) necessary to list IPs in the author list, so one could add "WHERE rev_user_id>0" to the query. That should take care of the anon vandals, which in the case of GWB should be quite a percentage.
Not anymore. On the other hand, that'd require scanning some index anyway, :)
It might also be possible to put an absolute limit into the query. But that is a question the the legal guys.
Would not help.
Brion, please note that this is not about my latest cute little script. AFAIK, there is currently *no* way to publish the GWB article as demanded by the GFDL short of downloading the *entire* en.wikipedia including all old revisions, installing MediaWiki, importing the whole thing and then run the query locally. Not exactly what I'd call user-friendly...
Maybe we should write author xml streams during backup process, but we should not solve this issue by running it all on our main db servers.
Domas