Magnus Manske wrote:
Brion, please refrain from quoting half-sentences of mine, which gives the impression I said something I didn't.
There is no legal necessity for the "list authors" feature. I never said that. I said that it is currently very inconvenient to fulfil the demands of the GFDL, because you have to download the entire wikipedia, including history, to correctly publish a single article. What good is it to have an open license if taking advantage of that is a PITA? (I'm talking non-geeks here, and geeks who have a slow machine/internet connection/etc.)
Okay, I quoted the entire paragraph to show that you said exactly what I thought you said.
Now, I ask again the same thing for the same reasons: please ask Brad Patrick, the foundation's legal counsel, about this to see if it's actually true.
If you want to contribute something that is actually helpful, please run the one little SQL query I asked you to run, and tell us how many seconds it takes. Maybe run it again ignoring anons. If it take 2 seconds, the feature could be activated (for the time being, of course); if it takes 30 seconds, activating it on Wikipedia is certainly out of the question.
For database performance issues I leave this in Domas's hands, since he'll just turn it off if he thinks it's a problem anyway.
A quick test in isolation on lomaria showed 'George W. Bush' taking about a second and 'Wikipedia:Sandbox' about 4 seconds. As frequently hit pages they may be better in cache already; Sandbox results were much faster after that first hit. Random-access could lead to cache churn, check with Domas.
EXPLAIN shows use of a temporary table, not necessarily a good sign but not too bad if it fits in memory:
mysql> EXPLAIN SELECT DISTINCT rev_user,rev_user_text FROM revision WHERE rev_page=3414021; +----------+-------+------------------------+---------+---------+------+-------+------------------------------+ | table | type | possible_keys | key | key_len | ref | rows | Extra | +----------+-------+------------------------+---------+---------+------+-------+------------------------------+ | revision | range | PRIMARY,page_timestamp | PRIMARY | 4 | NULL | 42114 | Using where; Using temporary | +----------+-------+------------------------+---------+---------+------+-------+------------------------------+ 1 row in set (0.02 sec)
More worrying than the time it takes is the amount of data it churns out: 8965 rows for George W. Bush, 21528 rows for Wikipedia:Sandbox. That's only going to get longer as time goes on, and it's unsustainable in the long term. (That's possibly why the GFDL explicitly *doesn't* require a list of every contributor.)
It's still a lot with accounts only: 2664 rows for GWB and 6569 for the sandbox.
So even if it's fast enough for the moment, I'd much prefer if we had something that fit clear requirements. If the idea is for every random person grabbing pages off our site to have the minimal GFDL requirements, I'm not so sure this fits the bill.
-- brion vibber (brion @ pobox.com)