Re:[Wikitech-l] Five points we should be discussion about the mediawiki projects

List overview All Threads
Download

newer

older

femmes

RE: [Wikitech-l] Five points we...

Nicolas Weeger

17 Mar 2005 17 Mar '05

10:10 a.m.

Hello.

Snipped many parts out, only replying to the last 2 paragraphs.

...

I hate committees, and I think it's a bad idea. However, I want to open discussion on how to address this problem, because I definitely think if it hasn't already, it deserves to be elevated to "problem" status.

I don't have anything against articles in many different flavors & formats. Ok, when i write articles i try to write'em in a certain way, so they have coherence. But apart that, people should do what they want.

I'm not sure we want to standardize everything. Because it means people have to learn that format, to know where to find it in the first place. You have to enforce it. It makes for wars as new users come in, or people don't like the format, and say "hey this format should be changed like that", and people reply "per decision of 2 weeks earlier no, we keep that", and so on and so on.

...

Fancruft and how to cope

Sure. When the wiki is slow, we should just cut off en:. After all, that's a fancruft, and it shouldn't exist. Or we should just concatenate all pages in one big page, there wouldn't need to have any page existence checking.

More seriously: saying that merging articles could (help) solve slowness is a bad social solution to a technical problem. People want to write articles, as many as they want. What would be next? "Sorry, you made more than 5 modifications in the last 15 minutes, please wait 15 minutes to that everyone get a chance to edit"?

Site is slow? Make software faster, buy new hardware, optimize, imagine a bewolf cluster of servers on the moon, whatever - do *not* try to restrict the freedom people have.

Sorry if my tone sounds rash. I'm totally against your idea, maybe because I do write "fancruft" articles but also because it's imo totally against the founding ideas of Wikipedia - this doesn't mean i despise you :)

Nicolas Ryo

Accédez au courrier électronique de La Poste : www.laposte.net ; 3615 LAPOSTENET (0,34/mn) ; tél : 08 92 68 13 50 (0,34/mn)

Show replies by date

David Gerard

17 Mar 17 Mar

11:04 a.m.

New subject: Five points we should be discussion about the mediawiki projects

Nicolas Weeger wrote:

...

...

Fancruft and how to cope

...

Sure. When the wiki is slow, we should just cut off en:. After all, that's a fancruft, and it shouldn't exist. Or we should just concatenate all pages in one big page, there wouldn't need to have any page existence checking. More seriously: saying that merging articles could (help) solve slowness is a bad social solution to a technical problem. People want to write articles, as many as they want. What would be next? "Sorry, you made more than 5 modifications in the last 15 minutes, please wait 15 minutes to that everyone get a chance to edit"?

Indeed. keats appears to be starting from a personal distaste and then claiming this will be the destruction of Wikipedia. I note a curious lack of substantiating, ahh, numbers. keats, do you have any?

Fundamentally, taking out the word "fancruft": keats appears to be claiming that the mere fact of having 500k articles is unsustainable in MediaWiki. Is this the case? Do we stop all article creation now? If not, what do we do? Ration them?

- d.

Alex J. Avriette

10:04 p.m.

New subject: Five points we should be discussion about the mediawiki projects

On Thu, 17 Mar 2005 10:04:00 +0000, David Gerard dgerard@gmail.com wrote:

...

Indeed. keats appears to be starting from a personal distaste and then claiming this will be the destruction of Wikipedia. I note a curious lack of substantiating, ahh, numbers. keats, do you have any?

First off, you can call me "Alex". That is my name. If you want to argue on IRC call me keats or whatever you choose to call me. I am not John Keats, nor am I a character in a book. A nick is chosen on IRC to be distinct from others. It could be waerth, jwales, TimStarling, or whatever. But, we are discussing this on a public mailing list, where my name is quite apparent, and I have signed my emails as such. Let's be adults about this.

Second, I do not have any numbers. I said that I felt that we had some problems. I proposed some solutions, and gave my analysis of said problems. The first step in solving a problem is to identify the problem. Then you go and find what might be possible solutions (more disk on the master db, squids on freebsd, postgres on the backend, multiply redundant masters, colos on different continents, a BigIP or two), and you /test them/. The fact that I haven't gone out and built my own wikipedia cluster and tested every solution I offered is hardly a fair criticism. I have at my disposal two powerbooks, two ibooks, and a single 2x 866mhz Linux firewall. I don't have the resources to do all this testing. Nobody just cut me a check for $86,000, either. I am offering my time and my expertise, and even some money. But the testing has to be done by the foundation.

...

Fundamentally, taking out the word "fancruft": keats appears to be claiming that the mere fact of having 500k articles is unsustainable in MediaWiki. Is this the case? Do we stop all article creation now? If not, what do we do? Ration them?

Do you not see that we are having weekly outages? So we hit 500k articles and lo, it holds together. Where do we go from here? With our one master database server. Somebody goes out and drops $100k on an 8-way 848 opteron. Somebody drops a further $100k on disk. Problem solved until we hit 100M articles. Right, because every single power ranger should have their own page, and every single villain, and every single care bear, and every character from charmed, and every dicdef that never makes it into the wiktionary, and so on and so forth.

You can't just say "let's just stick with what we're doing, it works for now, and we'll just grow the architecture we've got by throwing cubic dollars at it until it works properly."

You're wasting my and the foundation's money by doing so. FIx the architecture and you reduce the cost of operation.

I mean, have you actually ever designed anything near as complicated as the wikipedia? Have you actually ever been in a board room with the program manager, project manager, VP, eight developers, and two sysadmins when you all realize at the same time that the architecture you've got just won't scale to the point you need it to?

I've been there. I can draw you two distinct pictures on the whiteboard I mentioned in my previous email. One will be called "horribly fucked" and the other will be called "ideal". We can get from "horribly fucked" to "ideal" (that's what I do for a living), but we have to stop calling eachother names, and start figuring out what we can do to fix the problem. Test. Benchmark. I mean, start doing what it takes. Right now we're doing nothing. We're putting out fires. Adding more servers is only going to help you until you reach some new unsustainable point. The current architecture is (let me spell it out real big for you)

U N S C A L A B L E

period.

-- Alex Avriette avriette@gmail.com

Gerard Meijssen

10:58 p.m.

New subject: Five points we should be discussion about the mediawiki projects

Hoi, So we have a problem. The problem is the architecture. The architecture is being worked on. Release 1.5 has an improvement in the database design. The improvements made possible by better software and hardware are used up by the relentless growth of our consumer base. Every bit of capacity is used as it becomes available. This is also what makes it so interesting, Where do we get the money from to pay for all the nice toys.. ( he/she who dies with the most toys, wins)

When you compare what is done on the Mediawiki software and compare it with what is done in a commercial environment, than the first thing that strikes you is the budgets involved. The big achievement of Mediawiki is what it enables for what budget. The price performance ratio is staggering. Yes it comes with occasional downtime. We do not have a SLA, we run on BETA software, we have volunteers making the impossible a daily occurence and it does work on a best effort basis. Is it not great ??

When you consider that some stuff should be deleted in order to get the extra bit of umf out of our systems, you have to realise that we do not produce a "classic" encyclopedia. What we have with Wikipedia is different, it is all the weird and wonderfull stuff we have that makes us different. Some have argued that we do not need all these other wikipedia's because all the knowledge should be included in the one, the English one. I congratulate the en:wikipedia with their 500.000th article. At the same time the biggest growth is in the other projects. Removing a few articles, the so called fancruft, will not make a dent in a pack of butter as the growth is elsewhere and it will continue to grow exponentially for some time to come.

So what will the solution be ?? Your guess is as good as mine (have also been in steamy rooms with highflyers). But I strongly believe that a way will be found as this is one of the most amazing projects to work on. It does change the concepts of how you do business, it is a completely different ecology from the typical comercial world (by the way a professional is someone who is payed to do a job).

Thanks, GerardM

Alex J. Avriette wrote:

...

On Thu, 17 Mar 2005 10:04:00 +0000, David Gerard dgerard@gmail.com wrote:

...
Indeed. keats appears to be starting from a personal distaste and then claiming this will be the destruction of Wikipedia. I note a curious lack of substantiating, ahh, numbers. keats, do you have any?

First off, you can call me "Alex". That is my name. If you want to argue on IRC call me keats or whatever you choose to call me. I am not John Keats, nor am I a character in a book. A nick is chosen on IRC to be distinct from others. It could be waerth, jwales, TimStarling, or whatever. But, we are discussing this on a public mailing list, where my name is quite apparent, and I have signed my emails as such. Let's be adults about this.

Second, I do not have any numbers. I said that I felt that we had some problems. I proposed some solutions, and gave my analysis of said problems. The first step in solving a problem is to identify the problem. Then you go and find what might be possible solutions (more disk on the master db, squids on freebsd, postgres on the backend, multiply redundant masters, colos on different continents, a BigIP or two), and you /test them/. The fact that I haven't gone out and built my own wikipedia cluster and tested every solution I offered is hardly a fair criticism. I have at my disposal two powerbooks, two ibooks, and a single 2x 866mhz Linux firewall. I don't have the resources to do all this testing. Nobody just cut me a check for $86,000, either. I am offering my time and my expertise, and even some money. But the testing has to be done by the foundation.

...
Fundamentally, taking out the word "fancruft": keats appears to be claiming that the mere fact of having 500k articles is unsustainable in MediaWiki. Is this the case? Do we stop all article creation now? If not, what do we do? Ration them?

Do you not see that we are having weekly outages? So we hit 500k articles and lo, it holds together. Where do we go from here? With our one master database server. Somebody goes out and drops $100k on an 8-way 848 opteron. Somebody drops a further $100k on disk. Problem solved until we hit 100M articles. Right, because every single power ranger should have their own page, and every single villain, and every single care bear, and every character from charmed, and every dicdef that never makes it into the wiktionary, and so on and so forth.

You can't just say "let's just stick with what we're doing, it works for now, and we'll just grow the architecture we've got by throwing cubic dollars at it until it works properly."

You're wasting my and the foundation's money by doing so. FIx the architecture and you reduce the cost of operation.

I mean, have you actually ever designed anything near as complicated as the wikipedia? Have you actually ever been in a board room with the program manager, project manager, VP, eight developers, and two sysadmins when you all realize at the same time that the architecture you've got just won't scale to the point you need it to?

I've been there. I can draw you two distinct pictures on the whiteboard I mentioned in my previous email. One will be called "horribly fucked" and the other will be called "ideal". We can get from "horribly fucked" to "ideal" (that's what I do for a living), but we have to stop calling eachother names, and start figuring out what we can do to fix the problem. Test. Benchmark. I mean, start doing what it takes. Right now we're doing nothing. We're putting out fires. Adding more servers is only going to help you until you reach some new unsustainable point. The current architecture is (let me spell it out real big for you)

U N S C A L A B L E

period.

David Gerard

18 Mar 18 Mar

5:12 p.m.

New subject: Five points we should be discussion about the mediawiki projects

Alex J. Avriette wrote:

...

On Thu, 17 Mar 2005 10:04:00 +0000, David Gerard dgerard@gmail.com wrote:

...

...
Indeed. keats appears to be starting from a personal distaste and then claiming this will be the destruction of Wikipedia. I note a curious lack of substantiating, ahh, numbers. keats, do you have any?

...

First off, you can call me "Alex". That is my name.

Sorry. Alex.

...

Second, I do not have any numbers. I said that I felt that we had some problems. I proposed some solutions, and gave my analysis of said problems.

[...]

...

You're wasting my and the foundation's money by doing so. FIx the architecture and you reduce the cost of operation.

It was that you spent five and a half paragraphs (618 words) decrying fancruft and positing that the problem was too many articles, presumably caused primarily by fancruft (I'm presuming from the 618 words leading up to saying it was overloading the system). This seemed a somewhat novel argument against fancruft, and I recall asking a dev on IRC (think it was JamesDay, not sure) if too many articles was in fact a source of our problems, and being told no.

So yes: since it's an anti-fancruft diatribe with blame on the end, I'm going to ask for your numbers. And if you come back and say you have no numbers but instead give seven more vituperative paragraphs, I'll probably assume the sky is not in fact falling in the manner you describe.

Let me put it more simply:

1. Alex, are you *seriously* claiming excess small articles will hasten the downfall of Wikipedia?

[ ] Yes [ ] No

2. If "yes", do you have any numbers?

[ ] Yes, here they are [ ] No

- d.

Alex J. Avriette

6:41 p.m.

New subject: Five points we should be discussion about the mediawiki projects

On Fri, 18 Mar 2005 16:12:32 +0000, David Gerard dgerard@gmail.com wrote:

...

It was that you spent five and a half paragraphs (618 words) decrying fancruft and positing that the problem was too many articles, presumably caused primarily by fancruft (I'm presuming from the 618 words leading up to saying it was overloading the system). This seemed a somewhat novel argument against fancruft, and I recall asking a dev on IRC (think it was JamesDay, not sure) if too many articles was in fact a source of our problems, and being told no.

I've been accused many times of being inconcise. This is perhaps one of those times. The fact, however, that you and others are measuring the responses I am making to these posts by words and kilobytes is shocking to me. What a complete and utter waste of time.

...

So yes: since it's an anti-fancruft diatribe with blame on the end, I'm

Here's some vituperation for you. Kindly fuck off. This is not an anti-fancrut diatribe.

...

Alex, are you *seriously* claiming excess small articles will hasten the

downfall of Wikipedia?

[ ] Yes [X] No

If "yes", do you have any numbers?

[ ] Yes, here they are [ ] No [X] N/A

First, I am asking questions and making suggestions. If I made any statements that seemed to indicate that I knew that X thing was exploding the wikipedia, either I overstated my point, or somebody misunderstood what I was saying. I think you'll understand if you read on.

Let me explain the reason I bring fancruft up. In fact, since fancruft is obviously such a loaded term, let's just use the word "cruft" to refer to large groups of small related articles.

As a database goes searching through its indices and tables looking for a tuple, it must iteratively go over other tuples before it finds the one it wants. It doesn't just have some magic pointer that knows where [[designated marksman]] is. It has to figure it out. When the number of articles exceeds several hundred thousand, you really need to "give it hints" about where to find that article. I don't know if this is already being done, but it might be possible to use things like categories (or some form of tagging) to "associate" groups of articles so that, while they might not have to live in their own separate table, that they were more easily "findable" by the database.

Do you understand the difference between a sequential scan and an index scan? I mean, I don't know what technical level to start at. Postgres has this feature I like a lot called "explain", so I can say:

wikipedia=# explain select colname from tablename where length(colname) > 5;

and it will explain how the query planner intends to execute that query. This is how one obtains numbers and increases the performance of ones' SQL or ones' schema (although there's hardly any difference, now, is there?).

Right now, I can tell you that "having more articles" means "the wikipedia will get slower" because there is no way to avoid the fact that the database has to go through the articles its got to find the one you just asked it for.

Now might be a good time for somebody familiar with the schema to step in and either tell me that I'm FOS or whether MySQL has some devil-instilled magic that allows it to defy the laws of databases, or something.

Since there seems to be so much hostility coming from you, let me point you in the direction of two small pages which will help:

http://www.petdance.com/perl/geek-culture/

and

http://c2.com/cgi/wiki?SetTheBozoBit

Cheers, aa

-- Alex Avriette avriette@gmail.com

Andre Engels

7:01 p.m.

New subject: Five points we should be discussion about the mediawiki projects

On Fri, 18 Mar 2005 12:41:27 -0500, Alex J. Avriette avriette@gmail.com wrote:

...

As a database goes searching through its indices and tables looking for a tuple, it must iteratively go over other tuples before it finds the one it wants. It doesn't just have some magic pointer that knows where [[designated marksman]] is. It has to figure it out. When the number of articles exceeds several hundred thousand, you really need to "give it hints" about where to find that article. I don't know if this is already being done, but it might be possible to use things like categories (or some form of tagging) to "associate" groups of articles so that, while they might not have to live in their own separate table, that they were more easily "findable" by the database.

Why use categories or tagging for that? And how would it help at all. You have to have the article first before knowing what category or tag it has! A much better method, is to use the title itself as the tag, index the 'title' field, and then do a binary search on it.

No need for tagging, categories, or whatever. The only downside of it is that we are already using it, so implementing it will not make it any iota faster. But at least it doesn't slow it down either, which I fear your plans would.

Andre Engels

Alex J. Avriette

17 Mar 17 Mar

9:52 p.m.

New subject: Five points we should be discussion about the mediawiki projects

...

I don't have anything against articles in many different flavors & formats. Ok, when i write articles i try to write'em in a certain way, so they have coherence. But apart that, people should do what they want.

To quote Rlandmann from my talk page,

"Pointers for aircraft (including helos) are at Wikipedia:WikiProject Aircraft, where you'll also find links to the category system and current standard layout for these articles. I standardised your CH-21 Shawnee article as an example.

While there are now some quite evolved standards for warships and aircraft, standards for weapon systems more generally are all over the shop. There seems to have been an effort at creating a navigational template for missiles ( {{Missile types}} ), but it hasn't been widely rolled out, and perhaps was really only intended for general articles on categories of missile rather than specific systems anyway. Similarly, while a standard spec table for small arms was developed, it hasn't been implemented widely (and WikiProject Weaponry is now listed as inactive anyway).

There have been at least two attempts to develop a broad-based and consistent way to categorise weapons, but neither went very far.

Given the mess, at the moment I really only try to make sure that weapons articles are reasonably well-classified, and that aircraft weapon systems carry the {{airlistbox}} template. I also list new aircraft articles here and standardise them when I get the time (currently bogged down in mid-February).

If you want to have a stab at it, some kind of consistency is desperately needed for weapons on Wikipedia. At the moment, it's pretty much a model of the worst aspects of collaborative editing, so it'll be a long-term project!"

So I am clearly not alone in believing that content standardization is a good thing. I think there are two very strong arguments for content standardization (across categories, not across the entire site). FIrst, is the appearance, one of a cohesive and organized project. Second, machine parseability. As I mentioned in my original message, I have written many content scrapers, and it is exceptionally difficult to do for data which is so irregular. An earlier thread on this very list, earlier this week (sorry, I've since trashed it), where somebody said something (I paraphrase here), I am not aware that a parser for mediawiki exists. Wiki is notoriously hard to parse. Once you pick a wiki, you're stuck with it, because the data becomes so free-form and disorganized that a move (say from instiki to mediawiki) becomes an exercise in the impossible.

To request that we use the:

{ | foo metric | bar value |- | baz metric | quux value }

notation is not asking much. Or indeed, to standardize, as Rlandmann did with my [[CH-21 Shawnee]] article per the guidelines at http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Aircraft/page_content, on that. But when every article you come across is different it is difficult to obtain the information you have /come to wikipedia to find/. The wiktionary (which which I have only a passing familiarity, but I do contribute) is very clear in their "guidelines" that there is to be consistency, and you /shall/ use templates, rather than "It would be nice, but don't feel obligated to do so."

The wikipedia is not an exercise in anarchy.

...

I'm not sure we want to standardize everything. Because it means people have to learn that format, to know where to find it in the first place. You have to enforce it. It makes for wars as new users come in, or people don't like the format, and say "hey this format should be changed like that", and people reply "per decision of 2 weeks earlier no, we keep that", and so on and so on.

I'm real happy to go through and standardize articles. In fact, when I find an article that looks like it needs help, as I did with [[USS Bowfin (SS-287)]], I came on IRC to #wikipedia, and got help from admins who knew the { .. } wiki table format, explained it to me, and I went around looking at what everything else looked like, and made sure that "my page" looked just like the rest of them. It presently looks like it needs a little help (that image/text placement irks me). Sigh.

...

...

Fancruft and how to cope

Sure. When the wiki is slow, we should just cut off en:. After all, that's a fancruft, and it shouldn't exist. Or we should just concatenate all pages in one big page, there wouldn't need to have any page existence checking.

You know, I worked really hard to write a concise message to the list, to address a lot of concerns that I had. I offered myself wholly for help. I expressed a willingness to work with the community which seems to support "fancruft networks." I've received a lot of shit about this on IRC, and here on this list. I really don't appreciate it, and I don't see how people can be such biased assholes and say things like "just turn off en." Even as a joke. We are having weekly outages. At least. THERE IS A PROBLEM. It will require work to fix, and people are going to get their toes stepped on. Maybe they're yours, maybe they're mine, but let's not take potshots at eachother.

...

More seriously: saying that merging articles could (help) solve slowness is a bad social solution to a technical problem. People want to write articles, as many as they want.

I said I didn't know what the exact benefit or detriment was to merging many small articles (such as in the case of MMPR). I said that I'd like to know. A couple people have commented that they'd like to try the squids on FreeBSD (which has a fancier network stack than Linux, although 2.6 seems to be about even. But I digress..). People say that they'd like to beta test with postgres. What I am saying is there are problems and we are propping up the entire operation with constant maintenance, but no forward progress is being made. Nobody is testing whether the wikipedia is faster with many condensed pages (frankly, how to conduct that test is a mystery to me) or with lots of smaller pages. Maybe that's something that could be asked of the MySQL developers.

...

What would be next? "Sorry, you made more than 5 modifications in the last 15 minutes, please wait 15 minutes to that everyone get a chance to edit"?

Nobody is saying throttle users. Nobody is saying limit the number of pages one can create in a given category. There's no need to scream about the sky falling, that throttling and edit limits are on the horizon because some fascist asshole is going to impose them on all the doe-eyed wikipedians. It isn't. We need to work together to find a solution.

...

Site is slow? Make software faster, buy new hardware, optimize, imagine a bewolf cluster of servers on the moon, whatever - do *not* try to restrict the freedom people have.

I AM NOT ADVOCATING THE RESTRICTING OF ANYONES FREEDOM, for the last time.

However, you just fell into the trap that most software development projects fall into.

"Oh, our software runs slow. Well, buy faster hardware." "Oh, our software runs slow. Well, make the software faster."

When you find that you have flaws in your ARCHITECTURE (and I pointed out at least two), you need to fix it from a high level. I wish we could all sit in a room and draw this out on a whiteboard. You for one would be a lot less hostile.

...

Sorry if my tone sounds rash. I'm totally against your idea, maybe because I do write "fancruft" articles but also because it's imo totally against the founding ideas of Wikipedia - this doesn't mean i despise you :)

You're totally against WHAT idea? This is why I said this should be discussed in wiki format, on the wiki. Or did you not read that far?

alex

-- Alex Avriette avriette@gmail.com

Thomas Gries

9:59 p.m.

New subject: Five points we should be discussion about the mediawiki projects

Can we please come back applying the KISS principle: keep it short and simple ? Thanks

Alex J. Avriette wrote: 10 KByte

7072

Age (days ago)

7073

Last active (days ago)

wikitech-l@lists.wikimedia.org

8 comments

6 participants

tags (0)

participants (6)

Alex J. Avriette
Andre Engels
David Gerard
Gerard Meijssen
Nicolas Weeger
Thomas Gries