Brion Vibber schrieb:
Magnus Manske wrote:
Brion, please refrain from quoting half-sentences
of mine, which gives
the impression I said something I didn't.
There is no legal necessity for the "list authors" feature. I never said
that. I said that it is currently very inconvenient to fulfil the
demands of the GFDL, because you have to download the entire wikipedia,
including history, to correctly publish a single article. What good is
it to have an open license if taking advantage of that is a PITA? (I'm
talking non-geeks here, and geeks who have a slow machine/internet
connection/etc.)
Okay, I quoted the entire paragraph to show that you said exactly what I thought
you said.
Now, I ask again the same thing for the same reasons: please ask Brad Patrick,
the foundation's legal counsel, about this to see if it's actually true.
I checked the GFDL myself, and it seems that only *modified* copies
require the author list. So, you're right if it's just a PDF of the wiki
(not modified). And, I'm right for things like the WikiReaders and
WikiPress, or all things that are not plain copies. I've added basic ODT
export to the XML parser (which has become surprisingly fast, BTW). When
producing an editable format, providing a list of authors thus seems
prudent to prevent ... misunderstandings of the license.
If you want to contribute something that is
actually helpful, please run
the one little SQL query I asked you to run, and tell us how many
seconds it takes. Maybe run it again ignoring anons. If it take 2
seconds, the feature could be activated (for the time being, of course);
if it takes 30 seconds, activating it on Wikipedia is certainly out of
the question.
For database performance issues I leave this in Domas's hands, since he'll just
turn it off if he thinks it's a problem anyway.
A quick test in isolation on lomaria showed 'George W. Bush' taking about a
second and 'Wikipedia:Sandbox' about 4 seconds. As frequently hit pages they may
be better in cache already; Sandbox results were much faster after that first
hit. Random-access could lead to cache churn, check with Domas.
EXPLAIN shows use of a temporary table, not necessarily a good sign but not too
bad if it fits in memory:
mysql> EXPLAIN SELECT DISTINCT rev_user,rev_user_text FROM revision WHERE
rev_page=3414021;
+----------+-------+------------------------+---------+---------+------+-------+------------------------------+
| table | type | possible_keys | key | key_len | ref | rows |
Extra |
+----------+-------+------------------------+---------+---------+------+-------+------------------------------+
| revision | range | PRIMARY,page_timestamp | PRIMARY | 4 | NULL | 42114 |
Using where; Using temporary |
+----------+-------+------------------------+---------+---------+------+-------+------------------------------+
1 row in set (0.02 sec)
Thanks for running this. It seems, as Rob already said, it might be best
to plan for a few servers for a future API. I'll bother foundation-l
with this. Also, I volunteer to write/help with API development ;-)
OTOH, 4 seconds MySQL and little Apache load - compared to the 32
seconds it just took srv28 to render the actual GDW page - is not that
much...
More worrying than the time it takes is the amount of
data it churns out: 8965
rows for George W. Bush, 21528 rows for Wikipedia:Sandbox. That's only going to
get longer as time goes on, and it's unsustainable in the long term. (That's
possibly why the GFDL explicitly *doesn't* require a list of every contributor.)
It's still a lot with accounts only: 2664 rows for GWB and 6569 for the sandbox.
I expect the growth of the distinct author list for GWB to be slower,
with the protection in place. There is a lot of editing going on, but
many edits seem to be made by a small group. Also, 2664 rows would
result in a roughly estimated 160KB of XML, which can be cut down
significantly by using <c> instead of <contributor>, omitting user ID,
etc. Maybe 70KB total, transfered as 35KB gzipped (all just rough
estimations, but IMHO correct within the odrer of magnitude). We have
article texts that long.
So even if it's fast enough for the moment,
I'd much prefer if we had something
that fit clear requirements. If the idea is for every random person grabbing
pages off our site to have the minimal GFDL requirements, I'm not so sure this
fits the bill.
I now agree that this is less critical than I previously thought, but I
still think it will be very useful, increasingly so as more and more
Wikipedia "spin-off" products and services mushroom. Yes, it is not
"mission-critical", but neither is the whole Special:Export page itself,
the RSS feeds, or the random page function (which has its own database
field *and* index, for crying out loud!).
A few dedicated API servers are probably the way to go in the long run.
Magnus