Brion Vibber schrieb:
Magnus Manske wrote:
Brion, please refrain from quoting half-sentences of mine, which gives the impression I said something I didn't.
There is no legal necessity for the "list authors" feature. I never said that. I said that it is currently very inconvenient to fulfil the demands of the GFDL, because you have to download the entire wikipedia, including history, to correctly publish a single article. What good is it to have an open license if taking advantage of that is a PITA? (I'm talking non-geeks here, and geeks who have a slow machine/internet connection/etc.)
Okay, I quoted the entire paragraph to show that you said exactly what I thought you said.
Now, I ask again the same thing for the same reasons: please ask Brad Patrick, the foundation's legal counsel, about this to see if it's actually true.
I checked the GFDL myself, and it seems that only *modified* copies require the author list. So, you're right if it's just a PDF of the wiki (not modified). And, I'm right for things like the WikiReaders and WikiPress, or all things that are not plain copies. I've added basic ODT export to the XML parser (which has become surprisingly fast, BTW). When producing an editable format, providing a list of authors thus seems prudent to prevent ... misunderstandings of the license.
If you want to contribute something that is actually helpful, please run the one little SQL query I asked you to run, and tell us how many seconds it takes. Maybe run it again ignoring anons. If it take 2 seconds, the feature could be activated (for the time being, of course); if it takes 30 seconds, activating it on Wikipedia is certainly out of the question.
For database performance issues I leave this in Domas's hands, since he'll just turn it off if he thinks it's a problem anyway.
A quick test in isolation on lomaria showed 'George W. Bush' taking about a second and 'Wikipedia:Sandbox' about 4 seconds. As frequently hit pages they may be better in cache already; Sandbox results were much faster after that first hit. Random-access could lead to cache churn, check with Domas.
EXPLAIN shows use of a temporary table, not necessarily a good sign but not too bad if it fits in memory:
mysql> EXPLAIN SELECT DISTINCT rev_user,rev_user_text FROM revision WHERE rev_page=3414021; +----------+-------+------------------------+---------+---------+------+-------+------------------------------+ | table | type | possible_keys | key | key_len | ref | rows | Extra | +----------+-------+------------------------+---------+---------+------+-------+------------------------------+ | revision | range | PRIMARY,page_timestamp | PRIMARY | 4 | NULL | 42114 | Using where; Using temporary | +----------+-------+------------------------+---------+---------+------+-------+------------------------------+ 1 row in set (0.02 sec)
Thanks for running this. It seems, as Rob already said, it might be best to plan for a few servers for a future API. I'll bother foundation-l with this. Also, I volunteer to write/help with API development ;-)
OTOH, 4 seconds MySQL and little Apache load - compared to the 32 seconds it just took srv28 to render the actual GDW page - is not that much...
More worrying than the time it takes is the amount of data it churns out: 8965 rows for George W. Bush, 21528 rows for Wikipedia:Sandbox. That's only going to get longer as time goes on, and it's unsustainable in the long term. (That's possibly why the GFDL explicitly *doesn't* require a list of every contributor.)
It's still a lot with accounts only: 2664 rows for GWB and 6569 for the sandbox.
I expect the growth of the distinct author list for GWB to be slower, with the protection in place. There is a lot of editing going on, but many edits seem to be made by a small group. Also, 2664 rows would result in a roughly estimated 160KB of XML, which can be cut down significantly by using <c> instead of <contributor>, omitting user ID, etc. Maybe 70KB total, transfered as 35KB gzipped (all just rough estimations, but IMHO correct within the odrer of magnitude). We have article texts that long.
So even if it's fast enough for the moment, I'd much prefer if we had something that fit clear requirements. If the idea is for every random person grabbing pages off our site to have the minimal GFDL requirements, I'm not so sure this fits the bill.
I now agree that this is less critical than I previously thought, but I still think it will be very useful, increasingly so as more and more Wikipedia "spin-off" products and services mushroom. Yes, it is not "mission-critical", but neither is the whole Special:Export page itself, the RSS feeds, or the random page function (which has its own database field *and* index, for crying out loud!).
A few dedicated API servers are probably the way to go in the long run.
Magnus