MediaWiki converter (follow-up)

List overview All Threads
Download

newer

older

Getting involved

MediaWiki automated test run...

Magnus Manske

22 Mar 2006 22 Mar '06

4 p.m.

I have added an option to Special:Export that will add a list of all contributors of a page to the XML output. The list is distinct (each contributor mentioned only once). I also expanded the XML converter to use this. (Brion et al.: please put the new Special:Export on the live sites soon) For my local test site, where the new Special:Export is running, it will now add a list of all contributors to the output (IPs are omitted). It will even add people who worked on a template that is used in the document ;-) Next stop: OpenOffice .ODT output... Magnus

Attachments:

signature.asc (application/pgp-signature — 249 bytes)

Show replies by thread

Steve Bennett

22 Mar 22 Mar

4:28 p.m.

On 3/22/06, Magnus Manske <magnus.manske(a)web.de> wrote:

...

I don't suppose there's any way of removing vandals from that list...? Steve

Magnus Manske

5:40 p.m.

Steve Bennett wrote:

...

On 3/22/06, Magnus Manske <magnus.manske(a)web.de> wrote:

I don't suppose there's any way of removing vandals from that list...?

I plan on writing an automatic vandal detector right after the finishing touch on Duke Nukem Forever ;-) Seriously, I think it's a Bad Idea to allow the automatic removal of usernames, e.g. by a blacklist. But once I've got ODT export, you can eliminate them yourself manually (by "eliminate" I merely refer to removing their names, of course...) Magnus

Steve Bennett

7:18 p.m.

On 3/22/06, Magnus Manske <magnus.manske(a)web.de> wrote:

...

Seriously, I think it's a Bad Idea to allow the automatic removal of usernames, e.g. by a blacklist. But once I've got ODT export, you can eliminate them yourself manually (by "eliminate" I merely refer to removing their names, of course...)

Hmm. I'm thinking of classic articles like GWB that have been "worked on" by thousands of contributors. Of course, some of those contributors "worked on" it by "contributing" "bush is GAY" etc. On the other hand, people sometimes do "contribute" without ever editing the article, by suggesting ways of restructing on the talk page, doing a peer review etc. This isn't a criticism of your feature, more of the GFDL requirements in general. Steve

Brion Vibber

9:33 p.m.

Magnus Manske wrote:

...

That would be way too expensive, no can do. -- brion vibber (brion @ pobox.com)

Ilmari Karonen

11:45 p.m.

Steve Bennett wrote:

...

Hmm. I'm thinking of classic articles like GWB that have been "worked on" by thousands of contributors. Of course, some of those contributors "worked on" it by "contributing" "bush is GAY" etc.

One could ignore edits that have been reverted. Detecting reverts, in the strict sense of the word, is easy: all you need is a hash value for each revision. Of course, this wouldn't be perfect. But it'd be as close to perfect as any automated system can be. And it _would_ skip most vandals. -- Ilmari Karonen

Jay R. Ashworth

23 Mar 23 Mar

4:48 a.m.

On Wed, Mar 22, 2006 at 01:33:42PM -0800, Brion Vibber wrote:

...

Magnus Manske wrote:

That would be way too expensive, no can do.

Since I'm about to develop an interest in Magnus's translator, a) magnus: can you expand on what you need the new Export for, and brion, b) is there any chance it will go in the mainline distro, even if it's too expensive for the WMF sites? Cheers, -- jra -- Jay R. Ashworth jra(a)baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274 A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing on Usenet and in e-mail?

Magnus Manske

8:41 a.m.

Brion Vibber schrieb:

...

Magnus Manske wrote:

That would be way too expensive, no can do.

"SELECT DISTINCT rev_user,rev_user_text FROM revision" for a single page is too expensive? Come on, article history is probably more expensive than that! I didn't check if you reverted it already (you seem to have a bot running to revert my commits), but you can just turn it off in SpecialExport.php by commenting out a single line, or making it an option. You *did* see that, instead of mindless reverting, did you? Magnus

Magnus Manske

9:09 a.m.

Jay R. Ashworth schrieb:

...

On Wed, Mar 22, 2006 at 01:33:42PM -0800, Brion Vibber wrote:

Magnus Manske wrote:

That would be way too expensive, no can do.

I have put it back in, with an option, off by default. It can be activated in LocalSettings.php. AFAIK, for GFDL compliance, a list of the five main authors has to be given along with the document. For automatic processing, that means all contributors. Thus, without such a list, no GFDL-compliant document can be created automatically. Right now, I'm thinking of screenscraping the page history for the last 100000 revisions. It would be ugly to code, but it will only use existing MediaWiki functions, so no worry about costs there, right? ;-) BTW, I just tried that (manually) for GWB. Has been running for a few minutes, still not done. Possible DOS approach? We might want to fix that... Magnus

Magnus Manske

9:19 a.m.

Magnus Manske schrieb:

...

BTW, I just tried that (manually) for GWB. Has been running for a few minutes, still not done. Possible DOS approach? We might want to fix that...

After about 5 minutes: "Allowed memory size of ~84MB exceeded" Yes, we need to fix that. Magnus

Brion Vibber

10:32 a.m.

Magnus Manske wrote:

...

That would be way too expensive, no can do.

"SELECT DISTINCT rev_user,rev_user_text FROM revision" for a single page is too expensive? Come on, article history is probably more expensive than that!

Well, we've got this site called "Wikipedia", which has a lot of pages with tens of thousands of revisions on them. ;) Whenever we have a feature that allows going wild on all of them and allow people to use it, it bogs down the databases until we have to remove it. You need to take these into account *first*. -- brion vibber (brion @ pobox.com)

Rich Morin

11 a.m.

At 2:32 AM -0800 3/23/06, Brion Vibber wrote:

...

Whenever we have a feature that allows going wild on all of them and allow people to use it, it bogs down the databases until we have to remove it. You need to take these into account *first*.

Is there (could there be) a machine that is used for stress testing of new features such as this? This might let them be tried out under real use, without endangering the performance of Wikipedia. -r -- http://www.cfcl.com/rdm Rich Morin http://www.cfcl.com/rdm/resume rdm(a)cfcl.com http://www.cfcl.com/rdm/weblog +1 650-873-7841 Technical editing and writing, programming, and web development

Magnus Manske

12:57 p.m.

Brion Vibber schrieb:

...

Magnus Manske wrote:

That would be way too expensive, no can do.

"SELECT DISTINCT rev_user,rev_user_text FROM revision" for a single page is too expensive? Come on, article history is probably more expensive than that!

If you had read my other mails (about an hour before you send yours), you'd have seen that there is already such a thing, via the page history. I could probably demonstrate by requesting the history of GWB with limit=100000 a few dozen times simultanously, but I won't for obvious reasons. Could you do me a favor? Run a SELECT DISTINCT on all authors of GWB manually and take the time. This should represent the worst-case scenario. I'm curious about how long that takes. There are ways to limit this. AFAIK, it is not (strictly) necessary to list IPs in the author list, so one could add "WHERE rev_user_id>0" to the query. That should take care of the anon vandals, which in the case of GWB should be quite a percentage. It might also be possible to put an absolute limit into the query. But that is a question the the legal guys. Brion, please note that this is not about my latest cute little script. AFAIK, there is currently *no* way to publish the GWB article as demanded by the GFDL short of downloading the *entire* en.wikipedia including all old revisions, installing MediaWiki, importing the whole thing and then run the query locally. Not exactly what I'd call user-friendly... Magnus

Brion Vibber

7:50 p.m.

Magnus Manske wrote:

...

Brion, please note that this is not about my latest cute little script. AFAIK, there is currently *no* way to publish the GWB article as demanded by the GFDL

[snip] Would you mind talking to our legal counsel about that? It would be crazy of us to go splashing in features for "legal" reasons that don't actually exist. -- brion vibber (brion @ pobox.com)

Gregory Maxwell

8:15 p.m.

On 3/22/06, Ilmari Karonen <nospam(a)vyznev.net> wrote:

...

And what happens if the next edit merges some content back in from the reverted text? This happens, perhaps not frequently but it isn't that rare either. You can't assume that a edit didn't count just because it was reverted. I have a tool I call 'entropy flow' which I've been using for my "wikipedia text backend where all revisions fit in ram" project. I'll get around to releasing some stuff sooner or later, but it's easy enough to code yourself. It's a useful too to find out who were the real contributors... It works like this, take all n revisions of an article: *compress with a whole word replacement memoryless compression algorithm (this gets around 2:1 compression on enwiki text and it's "diff invarent") where common words are replaced with high byte huffman encoded keys from a pre computed dictionary (unicode properly escaped). (this is important because long words along aren't signs of greater contribution but the addition of more rare words probably is) *using xdelta3, compute the n^2 compressed binary diffs (n*(n-1+the null article)) between all versions. Yes, this is a number of diffs easily in the millions. *The diff sizes are the costs between the article nodes. Add a slight penalty to diffs that run in reverse direction to break ties. *Compute the minimum spanning tree between all these nodes. (I use the MST algo in the parallel boost graph Library http://www.osl.iu.edu/research/pbgl/ ... MST is polynomial time, but it's still expensive on millions of edges) *The users who have made edits to the nodes in the large subgraph containing the current revision are the contributors. All the little subgraphs hanging off the null article node are mostly vandalism and nonsense. What seems to be a spookily correct ranking of contributors can be found by totaling the diff sizes of each user in the main subgraph and sorting by size. Of course, no warranties that this actually tells you the contributors.. the best way to be safe is just to list all editors. I created this approach for finding the optimal diff order for delta compressing articles... this application is just a cute side effect. (When I'm actually compressing articles I just use a sliding window of 100ish revisions, much cheaper to compute and pretty much just as good, although it sometimes misses when someone reverts to an ancient version which is why it's better for finding contributors to consider the entire history in one MST pass).

Tels

8:38 p.m.

Moin, On Thursday 23 March 2006 21:15, Gregory Maxwell wrote:

...

On 3/22/06, Ilmari Karonen <nospam(a)vyznev.net> wrote:

And what happens if the next edit merges some content back in from the reverted text?

Since this happens _after_ the export, it is irrelevant. When you export an article, you can only take into account edits done before that moment in time :) But otherwise, valid points. That "contributor" stuff can easily get very messy and unpractical. Btw, the coverter is very neat and I will toy around with it later - whether wikimedia uses it or not. Luckily, I can just skip the whole "contributor" issue, because the wiki I maintain doesn't use the same license as wikipedia. Best wishes, Tels -- Signed on Thu Mar 23 21:35:15 2006 with key 0x93B84C15. Visit my photo gallery at http://bloodgate.com/photos/ PGP key on http://bloodgate.com/tels.asc or per email. "A bus station is where a bus stops. A train station is where a train stops. On my desk I have a work station." -- Unknown

Gregory Maxwell

9 p.m.

On 3/23/06, Tels <nospam-abuse(a)bloodgate.com> wrote:

...

Since this happens _after_ the export, it is irrelevant. When you export an article, you can only take into account edits done before that moment in time :)

No no no.. what was advocated was looking at the history and removing from consideration revisions which were reverted... i.e. #Ver 5 Tim was a king of alaska in 1423. --Eric #Ver 4 Tim was a king of alaska. --bob #Ver 3 Tim was a king. --Cathy revert to version by bob. #Ver 2 Tim was a king of alaska but he sucked. --joe #Ver 1 Tim was a king. --bob Of course this is a trivial simplification, but was Joe a contributor? If you ignore versions which were reverted, he was not... but he still could have been.

...

But otherwise, valid points. That "contributor" stuff can easily get very messy and unpractical.

Yep. Well there is always an easy out, just call all the editors contributors.. couldn't be more simple. Doesn't always work well in all media but it's okay a lot of the time.... Now only if we didn't allow users to have usernames like "throbbing monster cock" ... though you run into that set of problems even if you can figureout who the real contributors are.

Ilmari Karonen

9:50 p.m.

Gregory Maxwell wrote:

...

On 3/22/06, Ilmari Karonen <nospam(a)vyznev.net> wrote:

And what happens if the next edit merges some content back in from the reverted text?

This case falls under "not perfect but as close as can be". It's essentially the same problem as someone pasting content from another article, or from another source entirely. Even your diff-based scheme, while nifty indeed, doesn't solve that. In general, nothing can. By the way, it might be possible to optimize your scheme by using some form of histogram analysis to quickly establish lower bounds on edit distances. For that matter, if you're not using it already, even just the difference in article lengths gives a weak lower bound on the edit distance. Meanwhile, hashing can be used to establish upper bounds, both by hashing the entire text to detect exact reversions and by hashing deterministically chosen chunks (such as article sections) to detect local changes. -- Ilmari Karonen

Gregory Maxwell

10:26 p.m.

On 3/23/06, Ilmari Karonen <nospam(a)vyznev.net> wrote:

...

And what happens if the next edit merges some content back in from the reverted text?

Well actually it does... because I proposed only classifying articles which are completely disconnected from the main sub-graph as non-contributors. The revert+remerge will either end up in the entropy flow shortest path (if the removed text is smaller than the preserved text), or as a little stub hanging off the main history flow pathway should the diff to the reverted version be smaller.

...

By the way, it might be possible to optimize your scheme by using some form of histogram analysis to quickly establish lower bounds on edit distances. For that matter, if you're not using it already, even just the difference in article lengths gives a weak lower bound on the edit distance. Meanwhile, hashing can be used to establish upper bounds, both by hashing the entire text to detect exact reversions and by hashing deterministically chosen chunks (such as article sections) to detect local changes.

Oh sure, I wasn't really going for history detection.. I've been working on the optimal storage of all revisions. (BTW, en wikipedia revisions fit just fine in 6gb of ram stored with windowed optimum deltas, and I'm expecting that I'll get it under 4gb eventually...). I only mentioned it because it's the only automated method I've found thus far which I believe managed to remove a non-trivial amount of pure vandalsim without culling valid edits. If you really want to have fun with tracing contributions, you end up in a rathole of computational linguistics trying to detect equivalent text. It's a fun game, but I don't believe the state of the art is advanced enough to produce anything but highly misleading results. Somewhat offtopic, but another fun game to play is making english Wikipedia parsable by link grammar (http://www.link.cs.cmu.edu/link/). Linkgrammar is good enough that text which is completely unparsable is generally bad grammar that needs to be fixed. (I say almost because it is confused by a lack of serial-comma, which also messes up my meat based parser, but I understand that it's a matter which is widely debated). Unfortunately LG only produces a fully sane parsing in about 80% of the cases where it thinks it has one... which makes it not so useful for my dream of a bot which suggests places to attach citation requests. :)

Ilmari Karonen

11:09 p.m.

Gregory Maxwell wrote:

...

On 3/23/06, Ilmari Karonen <nospam(a)vyznev.net> wrote:

And what happens if the next edit merges some content back in from the reverted text?

Sorry, I meant that your solution _only_ handles the case where the pasted-in content comes from a previously reverted version of the same article. It doesn't handle the (probably even more common) similar cases where the content comes from another article or from an off-wiki source. So both of our methods will miss some contributors. Yours will miss slightly less than mine, at a cost of significantly more processing. But neither is perfect, and no automatic method _can_ be perfect in this regard, unless perhaps we were to somehow extend your entropy flow analysis to the entire whole of human expression, including spoken word and other ephemera. "Who first came up with this?" is a Hard Problem(TM). -- Ilmari Karonen

Domas Mituzas

24 Mar 24 Mar

8:28 a.m.

Magnus,

...

history. I could probably demonstrate by requesting the history of GWB with limit=100000 a few dozen times simultanously, but I won't for obvious reasons.

Instead of writing additional feature for that you could just restrict limit on page history :)

...

Could you do me a favor? Run a SELECT DISTINCT on all authors of GWB manually and take the time. This should represent the worst-case scenario. I'm curious about how long that takes.

Worst case scenario is that revision would have to be loaded into memory, instead of safe haven on our slow i/o :) That would mean that we'd need twice more memory for all DB servers.

...

There are ways to limit this. AFAIK, it is not (strictly) necessary to list IPs in the author list, so one could add "WHERE rev_user_id>0" to the query. That should take care of the anon vandals, which in the case of GWB should be quite a percentage.

Not anymore. On the other hand, that'd require scanning some index anyway, :)

...

It might also be possible to put an absolute limit into the query. But that is a question the the legal guys.

Would not help.

...

Brion, please note that this is not about my latest cute little script. AFAIK, there is currently *no* way to publish the GWB article as demanded by the GFDL short of downloading the *entire* en.wikipedia including all old revisions, installing MediaWiki, importing the whole thing and then run the query locally. Not exactly what I'd call user-friendly...

Maybe we should write author xml streams during backup process, but we should not solve this issue by running it all on our main db servers. Domas

Magnus Manske

12:31 p.m.

Brion Vibber wrote:

...

Magnus Manske wrote:

Brion, please note that this is not about my latest cute little script. AFAIK, there is currently *no* way to publish the GWB article as demanded by the GFDL

[snip] Would you mind talking to our legal counsel about that? It would be crazy of us to go splashing in features for "legal" reasons that don't actually exist.

Brion, please refrain from quoting half-sentences of mine, which gives the impression I said something I didn't. There is no legal necessity for the "list authors" feature. I never said that. I said that it is currently very inconvenient to fulfil the demands of the GFDL, because you have to download the entire wikipedia, including history, to correctly publish a single article. What good is it to have an open license if taking advantage of that is a PITA? (I'm talking non-geeks here, and geeks who have a slow machine/internet connection/etc.) If you want to contribute something that is actually helpful, please run the one little SQL query I asked you to run, and tell us how many seconds it takes. Maybe run it again ignoring anons. If it take 2 seconds, the feature could be activated (for the time being, of course); if it takes 30 seconds, activating it on Wikipedia is certainly out of the question. Magnus

Magnus Manske

12:54 p.m.

Domas Mituzas schrieb:

...

Magnus,

history. I could probably demonstrate by requesting the history of GWB with limit=100000 a few dozen times simultanously, but I won't for obvious reasons.

Instead of writing additional feature for that you could just restrict limit on page history :)

I'm not sure I follow the joke here. Yes, obviously we have to limit the limit (sorry;-) on history. My feature costs much less in time of database access and, especially, rendering.

...

Could you do me a favor? Run a SELECT DISTINCT on all authors of GWB manually and take the time. This should represent the worst-case scenario. I'm curious about how long that takes.

Worst case scenario is that revision would have to be loaded into memory, instead of safe haven on our slow i/o :) That would mean that we'd need twice more memory for all DB servers.

I don't follow this either, sorry. Until we have a concrete number on how long the GWB page (or a similar one) takes for that query, speculation if futile IMHO.

...

Not anymore. On the other hand, that'd require scanning some index anyway, :)

There's still be plenty of anons from before protection. But, if it isn't cheaper anyway, OK.

...

It might also be possible to put an absolute limit into the query. But that is a question the the legal guys.

Would not help.

OK.

...

Maybe we should write author xml streams during backup process, but we should not solve this issue by running it all on our main db servers.

Well, not on out "main" (=master) servers; it doesn't require writes. Or do you, with "main", mean the slave servers? If so, what other database servers besides "main" do we have? If there would be a database/apache server group dedicated to a (future) API in general, that would be great indeed ! And the author information will be outdated as soon as someone edits the article after the backup. Or do you mean inside the backup XML stream, for the "current version" dump? That would be an improvement, but no solution to aveage Joe who "wants these three pges as PDF". Magnus

Rob Church

3:43 p.m.

On 24/03/06, Magnus Manske <magnus.manske(a)web.de> wrote:

...

I take "main" to refer to the production SQL servers; the master and whatever slaves we happen to have right now (and Domas would know all about them better than I). I'd venture that if we are going to start offering an API etc. then we are going to need the hardware to back it up, so requesting a few servers dedicated for this sort of task isn't as arrogant as it might appear. :) Rob Church

Brion Vibber

9:11 p.m.

Magnus Manske wrote:

...

Okay, I quoted the entire paragraph to show that you said exactly what I thought you said. Now, I ask again the same thing for the same reasons: please ask Brad Patrick, the foundation's legal counsel, about this to see if it's actually true.

...

If you want to contribute something that is actually helpful, please run the one little SQL query I asked you to run, and tell us how many seconds it takes. Maybe run it again ignoring anons. If it take 2 seconds, the feature could be activated (for the time being, of course); if it takes 30 seconds, activating it on Wikipedia is certainly out of the question.

For database performance issues I leave this in Domas's hands, since he'll just turn it off if he thinks it's a problem anyway. A quick test in isolation on lomaria showed 'George W. Bush' taking about a second and 'Wikipedia:Sandbox' about 4 seconds. As frequently hit pages they may be better in cache already; Sandbox results were much faster after that first hit. Random-access could lead to cache churn, check with Domas. EXPLAIN shows use of a temporary table, not necessarily a good sign but not too bad if it fits in memory: mysql> EXPLAIN SELECT DISTINCT rev_user,rev_user_text FROM revision WHERE rev_page=3414021; +----------+-------+------------------------+---------+---------+------+-------+------------------------------+ | table | type | possible_keys | key | key_len | ref | rows | Extra | +----------+-------+------------------------+---------+---------+------+-------+------------------------------+ | revision | range | PRIMARY,page_timestamp | PRIMARY | 4 | NULL | 42114 | Using where; Using temporary | +----------+-------+------------------------+---------+---------+------+-------+------------------------------+ 1 row in set (0.02 sec) More worrying than the time it takes is the amount of data it churns out: 8965 rows for George W. Bush, 21528 rows for Wikipedia:Sandbox. That's only going to get longer as time goes on, and it's unsustainable in the long term. (That's possibly why the GFDL explicitly *doesn't* require a list of every contributor.) It's still a lot with accounts only: 2664 rows for GWB and 6569 for the sandbox. So even if it's fast enough for the moment, I'd much prefer if we had something that fit clear requirements. If the idea is for every random person grabbing pages off our site to have the minimal GFDL requirements, I'm not so sure this fits the bill. -- brion vibber (brion @ pobox.com)

Jay R. Ashworth

25 Mar 25 Mar

3:40 a.m.

On Fri, Mar 24, 2006 at 01:11:23PM -0800, Brion Vibber wrote:

...

So even if it's fast enough for the moment, I'd much prefer if we had something that fit clear requirements. If the idea is for every random person grabbing pages off our site to have the minimal GFDL requirements, I'm not so sure this fits the bill.

Any comment, then, brion, on whether it might make it into the mainline distribution, which presumably has less strict requirements on such things? Cheers, -- jra -- Jay R. Ashworth jra(a)baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274 A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing on Usenet and in e-mail?

Jonathan

8:36 p.m.

Magnus Manske wrote:

...

Brion Vibber wrote:

Would you mind talking to our legal counsel about that? It would be crazy of us to go splashing in features for "legal" reasons that don't actually exist.

There is no legal necessity for the "list authors" feature. I never said that. I said that it is currently very inconvenient to fulfil the demands of the GFDL, because you have to download the entire wikipedia, including history, to correctly publish a single article. What good is it to have an open license if taking advantage of that is a PITA? (I'm talking non-geeks here, and geeks who have a slow machine/internet connection/etc.)

There was a barely related discussion over on wiken-l. I suggested a change, which would allow a form of anonymity. The anonymity was not the significant part. However, the responses included claims where other licenses besides just the GFDL are used. Therefore, there is a need for the list of authors for attribution with those licenses. Jonathan

Magnus Manske

27 Mar 27 Mar

9:49 a.m.

Brion Vibber schrieb:

...

Magnus Manske wrote:

I checked the GFDL myself, and it seems that only *modified* copies require the author list. So, you're right if it's just a PDF of the wiki (not modified). And, I'm right for things like the WikiReaders and WikiPress, or all things that are not plain copies. I've added basic ODT export to the XML parser (which has become surprisingly fast, BTW). When producing an editable format, providing a list of authors thus seems prudent to prevent ... misunderstandings of the license.

...

Thanks for running this. It seems, as Rob already said, it might be best to plan for a few servers for a future API. I'll bother foundation-l with this. Also, I volunteer to write/help with API development ;-) OTOH, 4 seconds MySQL and little Apache load - compared to the 32 seconds it just took srv28 to render the actual GDW page - is not that much...

...

More worrying than the time it takes is the amount of data it churns out: 8965 rows for George W. Bush, 21528 rows for Wikipedia:Sandbox. That's only going to get longer as time goes on, and it's unsustainable in the long term. (That's possibly why the GFDL explicitly *doesn't* require a list of every contributor.) It's still a lot with accounts only: 2664 rows for GWB and 6569 for the sandbox.

I expect the growth of the distinct author list for GWB to be slower, with the protection in place. There is a lot of editing going on, but many edits seem to be made by a small group. Also, 2664 rows would result in a roughly estimated 160KB of XML, which can be cut down significantly by using <c> instead of <contributor>, omitting user ID, etc. Maybe 70KB total, transfered as 35KB gzipped (all just rough estimations, but IMHO correct within the odrer of magnitude). We have article texts that long.

...

I now agree that this is less critical than I previously thought, but I still think it will be very useful, increasingly so as more and more Wikipedia "spin-off" products and services mushroom. Yes, it is not "mission-critical", but neither is the whole Special:Export page itself, the RSS feeds, or the random page function (which has its own database field *and* index, for crying out loud!). A few dedicated API servers are probably the way to go in the long run. Magnus

Magnus Manske

10:07 a.m.

Jay R. Ashworth schrieb:

...

On Fri, Mar 24, 2006 at 01:11:23PM -0800, Brion Vibber wrote:

Any comment, then, brion, on whether it might make it into the mainline distribution, which presumably has less strict requirements on such things?

It is currently in, but deactivated. So it will likely be part of the next release. To activate, set $wgExportAllowListContributors = true in LocalSettings.php Magnus

Jay R. Ashworth

2:41 p.m.

On Mon, Mar 27, 2006 at 12:07:58PM +0200, Magnus Manske wrote:

...

Jay R. Ashworth schrieb: > On Fri, Mar 24, 2006 at 01:11:23PM -0800, Brion Vibber wrote: > >> So even if it's fast enough for the moment, I'd much prefer if we >> had something that fit clear requirements. If the idea is for every >> random person grabbing pages off our site to have the minimal GFDL >> requirements, I'm not so sure this fits the bill. > > Any comment, then, brion, on whether it might make it into the mainline > distribution, which presumably has less strict requirements on such > things?

...

It is currently in, but deactivated. So it will likely be part of the next release. To activate, set $wgExportAllowListContributors = true in LocalSettings.php

Noted; thanks. I'll be snarfing your package to look at this week. Cheers, -- jra -- Jay R. Ashworth jra(a)baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274 A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing on Usenet and in e-mail?

6613

days inactive

6618

days old

wikitech-l@lists.wikimedia.org

Manage subscription

29 comments

11 participants

tags (0)

participants (11)

Brion Vibber
Domas Mituzas
Gregory Maxwell
Ilmari Karonen
Jay R. Ashworth
Jonathan
Magnus Manske
Rich Morin
Rob Church
Steve Bennett
Tels