Domas Mituzas schrieb:
Magnus,
history. I could probably demonstrate by requesting the history of GWB with limit=100000 a few dozen times simultanously, but I won't for obvious reasons.
Instead of writing additional feature for that you could just restrict limit on page history :)
I'm not sure I follow the joke here. Yes, obviously we have to limit the limit (sorry;-) on history. My feature costs much less in time of database access and, especially, rendering.
Could you do me a favor? Run a SELECT DISTINCT on all authors of GWB manually and take the time. This should represent the worst-case scenario. I'm curious about how long that takes.
Worst case scenario is that revision would have to be loaded into memory, instead of safe haven on our slow i/o :) That would mean that we'd need twice more memory for all DB servers.
I don't follow this either, sorry. Until we have a concrete number on how long the GWB page (or a similar one) takes for that query, speculation if futile IMHO.
There are ways to limit this. AFAIK, it is not (strictly) necessary to list IPs in the author list, so one could add "WHERE rev_user_id>0" to the query. That should take care of the anon vandals, which in the case of GWB should be quite a percentage.
Not anymore. On the other hand, that'd require scanning some index anyway, :)
There's still be plenty of anons from before protection. But, if it isn't cheaper anyway, OK.
It might also be possible to put an absolute limit into the query. But that is a question the the legal guys.
Would not help.
OK.
Brion, please note that this is not about my latest cute little script. AFAIK, there is currently *no* way to publish the GWB article as demanded by the GFDL short of downloading the *entire* en.wikipedia including all old revisions, installing MediaWiki, importing the whole thing and then run the query locally. Not exactly what I'd call user-friendly...
Maybe we should write author xml streams during backup process, but we should not solve this issue by running it all on our main db servers.
Well, not on out "main" (=master) servers; it doesn't require writes. Or do you, with "main", mean the slave servers? If so, what other database servers besides "main" do we have? If there would be a database/apache server group dedicated to a (future) API in general, that would be great indeed !
And the author information will be outdated as soon as someone edits the article after the backup. Or do you mean inside the backup XML stream, for the "current version" dump? That would be an improvement, but no solution to aveage Joe who "wants these three pges as PDF".
Magnus