Re: Chat about Wikipedia performance?

List overview All Threads
Download

newer

older

Experimental page caching in cvs,...

Mailing lists on ibiblio?

David A. Wheeler

3 May 2003 3 May '03

1:32 a.m.

I've looked at Brion Vibber's "ps auxwww" output (thanks!!). Although the MySQL daemons take up the lion's share of memory, they don't take much of the %CPU, even in aggregate. Instead, the CPU seems to be taken up by the Apache daemons (/usr/local/apache/bin/httpd). It doesn't appear that one daemon takes up all the time; it appears spread out to some extent (a little bit by each, though there IS a lot of variance). Presumably this is due to each one executing the PHP scripts.

Clearly speeding execution of the PHP scripts would help. One way is to reduce the work they have to do (e.g., caching the HTML). Another is coding the hotspot (e.g., as a loaded C module). But doing it right requires identifying what the hotspot is in the PHP scripts.

Is there a way to enable performance monitoring in PHP, like gprof in C, to figure out where the hotspots in the PHP scripts are? Failing that, I guess you could insert monitoring points in various places (painful, painful).

Of course, this doesn't mean that moving wikitext from MySQL to the filesystem, or using the filesystem as an HTML cache, is a bad idea. I don't know how transmitting data from MySQL to the scripts is accounted for; the transit time betwen script and MySQL may be hidden in the script performance measures.

Show replies by date

Nick Reinking

3 May 3 May

2:21 a.m.

New subject: Chat about Wikipedia performance?

On Fri, May 02, 2003 at 01:02:13PM -0700, David A. Wheeler wrote:

...

Clearly speeding execution of the PHP scripts would help. One way is to reduce the work they have to do (e.g., caching the HTML). Another is coding the hotspot (e.g., as a loaded C module). But doing it right requires identifying what the hotspot is in the PHP scripts.

I'm actually in the middle of a C project to reduce the wikitext parser to a two-pass parser. It should reduce the complexity of the wikitext down to a point where the only thing the PHP code will have to do is:

* Handle links / link lookup * Ignore links in <nowiki> (everything else is done) * Handle <math> conversion * ~~~ and ~~~~ * ISBN lookups

Some of that could possibly be moved in there in the future; probably everything but the link lookups and link ignoring.

Everything else should be down by the C module underneath. This should be a significant speed-up. Now, I'm only about a third of the way done with the code, but my lexical analyzer is pretty speedy thus far (about 25000 lines/sec). It currently handles:

---- == === ==== \n\n '' ''' ''''' (better than the current code, which has a problem handling ''''')

I still need to do: * Lists * Manual formatting * <nowiki> conversion

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

Lee Daniel Crocker

3:20 a.m.

New subject: Chat about Wikipedia performance?

...

(Nick Reinking nick@twoevils.org):

I'm actually in the middle of a C project to reduce the wikitext parser to a two-pass parser...

Don't stop working on it, but be aware that when our performance issues are lessened for a while, I plan to go back to some work I had been planning for a long time to simplify and formalize wikitext syntax while adding some powerful features to it, so make sure your code is open and modular enough that it can be changed.

The first plan was to unify link syntax; rather than having [http://xxx zzz] for external links and [[xxx|zzz]] for internals, I want to unify on [[xxx|zzz]] for both, and treat "http", "ftp", etc. as special namespaces that makes external links.

The ''''' thing has /never/ worked correctly, and still doesn't. Part of the problem is that there's no precise, formal definition of what it /should/ do, and no obvious "best" choice. I want to nail that down soon.

I really want to implement styles with the {{class}} syntax I detailed earlier. That will let HTML gurus produce effects that will be much easier for novices to edit than the current mess.

Finally, I want to do something about a wikitext syntax for tables, probably using || characters.

-- Lee Daniel Crocker lee@piclab.com http://www.piclab.com/lee/ "All inventions or works of authorship original to me, herein and past, are placed irrevocably in the public domain, and may be used or modified for any purpose, without permission, attribution, or notification."--LDC

Nick Reinking

8 May 8 May

2:37 a.m.

New subject: Chat about Wikipedia performance?

...

...
(Nick Reinking nick@twoevils.org):

I'm actually in the middle of a C project to reduce the wikitext parser to a two-pass parser...

Just to update everybody on my progress with the C wikitext parser:

To do: * Lists of any sort

Done: * Ignores <math> * Converts < > and & inside <nowiki> * <pre> (space at beginning of line) * <hr> (---- at beginning of line) * Sections, subsections, and subsubsections (==, ===, and ==== respectively) * Emphasis, strong emphasis, and very strong emphasis ('', ''', and ''''') * {{CURRENTMONTH}}, {{CURRENTDAY}}, {{CURRENTYEAR}}, {{CURRENTTIME}} * Basic links (http://, ftp://, gopher://, news://, etc.) * Complex basic links ([http://... Blah Blah]

Possibly later: * ISBN lookups * Handle <math> conversion

Must be done by PHP: * Handle links / link lookup * Ignore links in <nowiki> * ~~~ and ~~~~ * {{NUMBEROFARTICLES}}, {{CURRENTMONTHNAME}}, {{CURRENTDAYNAME}}

Couple quick questions: When Wikitext is pulled from the database, what are the newlines? Are they always \n? If so, I can clean up the parsing a bit and eek a bit more performance out (not a big deal). Also, what format is the wikitext stored in the database as? UTF-8? UTF-16?

As far as performance goes, with what I'm handling now, with all the .txt data files in the testsuite (x256 = 492672 lines), I'm seeing parsing speeds of about 86600 lines/sec (in an 18KB executable).

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

Tomasz Wegrzanowski

3:16 a.m.

New subject: Chat about Wikipedia performance?

On Wed, May 07, 2003 at 04:07:00PM -0500, Nick Reinking wrote:

...

Also, what format is the wikitext stored in the database as? UTF-8? UTF-16?

Most in UTF-8. Some in ISO-8859-1 but they should move to UTF-8 at some point.

...

As far as performance goes, with what I'm handling now, with all the .txt data files in the testsuite (x256 = 492672 lines), I'm seeing parsing speeds of about 86600 lines/sec (in an 18KB executable).

On what hardware ? What test suite ? kB/sec would be more useful estimate than line/sec.

Nick Reinking

7:33 a.m.

New subject: Chat about Wikipedia performance?

On Wed, May 07, 2003 at 11:46:59PM +0200, Tomasz Wegrzanowski wrote:

...

On Wed, May 07, 2003 at 04:07:00PM -0500, Nick Reinking wrote:

...
Also, what format is the wikitext stored in the database as? UTF-8? UTF-16?

Most in UTF-8. Some in ISO-8859-1 but they should move to UTF-8 at some point.

...
As far as performance goes, with what I'm handling now, with all the .txt data files in the testsuite (x256 = 492672 lines), I'm seeing parsing speeds of about 86600 lines/sec (in an 18KB executable).

On what hardware ? What test suite ? kB/sec would be more useful estimate than line/sec.

Hardware: P4 1.8Ghz 512MB RAM The files are from CVS under phase3/testsuite/data/*.txt

The file is 28650KB, so 5036KB/sec.

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

Lee Daniel Crocker

3:18 a.m.

New subject: Chat about Wikipedia performance?

...

(Nick Reinking nick@twoevils.org):

Couple quick questions: When Wikitext is pulled from the database, what are the newlines?

MySQL gives back whatever you give it. We generally give it Unix-style text with just \n, but a few browsers might add CRs.

...

Are they always \n? If so, I can clean up the parsing a bit and eke a bit more performance out (not a big deal).

It shouldn't hurt performance to just ignore and skip CRs. That can be done in the lexer. You should never encounter CR-only line ends.

...

Also, what format is the wikitext stored in the database as? UTF-8? UTF-16?

Some of the foreign ones use UTF-8. The English one is ISO-8859-1.

...

As far as performance goes, with what I'm handling now, with all the .txt data files in the testsuite (x256 = 492672 lines), I'm seeing parsing speeds of about 86600 lines/sec (in an 18KB executable).

So on a typical page of, say, 40-50 lines, that makes half a millisecond spent in parsing. If PHP were 100 times worse, it would account for 1/20th of a second per page fetch. Doesn't sound like much of a problem to me, and I doubt it's 1000 times worse.

Just curious: what does your parser do with Quotes.txt from the test suite?

Nick Reinking

7:42 a.m.

New subject: Chat about Wikipedia performance?

On Wed, May 07, 2003 at 04:48:03PM -0500, Lee Daniel Crocker wrote:

...

So on a typical page of, say, 40-50 lines, that makes half a millisecond spent in parsing. If PHP were 100 times worse, it would account for 1/20th of a second per page fetch. Doesn't sound like much of a problem to me, and I doubt it's 1000 times worse.

Just curious: what does your parser do with Quotes.txt from the test suite?

Well, I suspect it is about 100 times (or more slower). I don't understand the architecture of the parser perfectly, but a similiar project using a lexical parser (as opposed to progressive pattern matches) was about 200 time slower. At the very least, it'll make the PHP code considerably cleaner.

As far as Quotes.txt goes, ignoring the s, we get:

Wikipedia quoting tests: (1) normal bold normal (2) normal italic normal (3) normal bold italic normal (4) normal bold bold italic bold normal (5) normal italic bold italic italic normal (6) normal bold italic bold normal (7) normal bold italic italic normal (8) normal italic bold italic normal (9) normal bold bold italic normal (10) normal bold's normal (11) normal italic's normal (12) normal italic's bold's italic italic's normal (13) normal bold's italic bold's normal (14) normal italic normal (15) normal 'bold normal (16) normal italic normal italic normal (17) normal italic normal bold normal (18) normal bold normal bold normal (19) normal bold normal italic normal

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

Lee Daniel Crocker

8:10 a.m.

New subject: Chat about Wikipedia performance?

...

(Nick Reinking nick@twoevils.org):

As far as Quotes.txt goes, ignoring the s, we get: [...]

Well that's broken in very different ways from the current code and older versions. For one thing, \n should close all open phrase-level markup; i.e., '' and ''' should not span lines.

On the other hand, ... IS valid and legal (if somewhat redundant) HTML.

Nick Reinking

9:51 a.m.

New subject: Chat about Wikipedia performance?

On Wed, May 07, 2003 at 09:40:22PM -0500, Lee Daniel Crocker wrote:

...

...
(Nick Reinking nick@twoevils.org):

As far as Quotes.txt goes, ignoring the s, we get: [...]

Well that's broken in very different ways from the current code and older versions. For one thing, \n should close all open phrase-level markup; i.e., '' and ''' should not span lines.

On the other hand, ... IS valid and legal (if somewhat redundant) HTML.

As far as the \n closing, I can change it to behave like that, although it seems clearer to me if it continues to span lines (ala HTML). It doesn't bother me either way (other than a bit more coding). Do a lot of people not close their ''/'''/'''''s?

This is how it works, based on what seems to be represented on http://www.wikipedia.org/wiki/Wikipedia%3AHow_to_edit_a_page

'' ... '' <---- ... ''' ... ''' <---- ... ''''' ... ''''' <---- ...

That is, it considers ''''' to be entirely different than ''' followed by '' (which is how the current code parses it, but not how the HOWTO page seems to explain it).

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

Lee Daniel Crocker

10:26 a.m.

New subject: Chat about Wikipedia performance?

...

(Nick Reinking nick@twoevils.org):

As far as the \n closing, I can change it to behave like that, although it seems clearer to me if it continues to span lines (ala HTML). It doesn't bother me either way (other than a bit more coding). Do a lot of people not close their ''/'''/'''''s?

Wiki syntax is a line-based syntax. There is /no/ wiki markup that spans lines. It makes editing much simpler: if you make a mistake and forget to close something, it gets closed off quickly. HTML is not designed to be human-editable; wiki syntax is.

...

This is how it works, based on what seems to be represented on http://www.wikipedia.org/wiki/Wikipedia%3AHow_to_edit_a_page

'' ... '' <---- ... ''' ... ''' <---- ... ''''' ... ''''' <---- ...

Sure, those are the easy cases. The code version before this one did those right, but screwed up on other cases (like ''a'''b'''c''). Like I said, half the battle here will be defining exactly what /should/ be done in all cases.

John R. Owens

11:37 a.m.

New subject: Chat about Wikipedia performance?

On Wed, 7 May 2003, Lee Daniel Crocker wrote:

...

Date: Wed, 7 May 2003 23:56:54 -0500 From: Lee Daniel Crocker lee@piclab.com Subject: Re: [Wikitech-l] Re: Chat about Wikipedia performance?

...
(Nick Reinking nick@twoevils.org):

As far as the \n closing, I can change it to behave like that, although it seems clearer to me if it continues to span lines (ala HTML). It doesn't bother me either way (other than a bit more coding). Do a lot of people not close their ''/'''/'''''s?

Wiki syntax is a line-based syntax. There is /no/ wiki markup that spans lines. It makes editing much simpler: if you make a mistake and forget to close something, it gets closed off quickly. HTML is not designed to be human-editable; wiki syntax is.

Not that that's not enough, but of course, lines (that don't continue on the next one without a *, :, etc.) are also enclosed by (unclosed! grr) tags, and you can't put a in the middle of a and close it after the next and call it good HTML.

-- John R. Owens http://www.ghiapet.homeip.net/ Life's full of mysteries. Consider this one of them. --Commander Jeffrey Sinclair

Nick Reinking

7:28 p.m.

New subject: Chat about Wikipedia performance?

On Thu, May 08, 2003 at 01:07:19AM -0500, John R. Owens wrote:

...

...
...
As far as the \n closing, I can change it to behave like that, although it seems clearer to me if it continues to span lines (ala HTML). It doesn't bother me either way (other than a bit more coding). Do a lot of people not close their ''/'''/'''''s?

Wiki syntax is a line-based syntax. There is /no/ wiki markup that spans lines. It makes editing much simpler: if you make a mistake and forget to close something, it gets closed off quickly. HTML is not designed to be human-editable; wiki syntax is.

Not that that's not enough, but of course, lines (that don't continue on the next one without a *, :, etc.) are also enclosed by (unclosed! grr)

 tags, and you can't put a in the middle of a and close it after the next and call it good HTML.

A couple things... I guess I don't see it as being too difficult or too complicated for users to understand that you need to enclose text you want to place emphasis on with '''. I know that stopping at newlines prevents an entire article from being emphasized, but it also prevents users from having large sections of emphasized text (without putting emphasis marks on every line). The Howto page certainly makes it look like you have to enclose your text, and in all the pages I've edited, I've never come across a page that leaves open emphasis marks.

Also, there are lots of line spanning constructs in wikitext. <pre>, <nowiki>, <tr>, <td>, etc. Now, I _know_ that (with the exception of <nowiki>) that these are HTML constructs (and not Wikitext constructs), but to the average user, they are exactly the same. So, some things span multiple lines, and some things don't. I think that is confusing.

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

Nick Reinking

7:49 p.m.

New subject: Chat about Wikipedia performance?

...

...
...
Wiki syntax is a line-based syntax. There is /no/ wiki markup that spans lines. It makes editing much simpler: if you make a mistake and forget to close something, it gets closed off quickly. HTML is not designed to be human-editable; wiki syntax is.

Not that that's not enough, but of course, lines (that don't continue on the next one without a *, :, etc.) are also enclosed by (unclosed! grr)

 tags, and you can't put a in the middle of a and close it after the next and call it good HTML.

A couple things... I guess I don't see it as being too difficult or too complicated for users to understand that you need to enclose text you want to place emphasis on with '''. I know that stopping at newlines prevents an entire article from being emphasized, but it also prevents users from having large sections of emphasized text (without putting emphasis marks on every line). The Howto page certainly makes it look like you have to enclose your text, and in all the pages I've edited, I've never come across a page that leaves open emphasis marks.

Also, there are lots of line spanning constructs in wikitext. <pre>, <nowiki>, <tr>, <td>, etc. Now, I _know_ that (with the exception of <nowiki>) that these are HTML constructs (and not Wikitext constructs), but to the average user, they are exactly the same. So, some things span multiple lines, and some things don't. I think that is confusing.

I guess I'd like to clarify one thing. I don't want to sound pushy, or "not a team player", or somebody who is just jumping in and disrupting the good work that everybody else is doing. And you all are doing a great job, BTW. ;)

Anyways, what I want to clarify. To me, when my mind encounters '' or ''', it thinks, "Ooo! Quotation marks that make stuff bold!". My mind is used to closing quotation marks, so I guess that's why it makes the most sense for them to span multiple lines until they are closed. I'm not sure how quotes work in a lot of other languages, but I know they're closed in Japanese much the same way (but with different symbols than quotation marks).

That's not to say that stopping them at the end of a line is a bad idea - I'm sure it helps a lot to prevent new users from making a bad mistake, seeing a messed up page, and giving up.

In the end, I'm writing a new parser for Wikipedia, not for myself. If everybody thinks it should end at newlines, I can make it do that, and that will be that. :)

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

Lee Daniel Crocker

11:38 p.m.

New subject: Chat about Wikipedia performance?

...

(Nick Reinking nick@twoevils.org):

Anyways, what I want to clarify. To me, when my mind encounters '' or ''', it thinks, "Ooo! Quotation marks that make stuff bold!". My mind is used to closing quotation marks, so I guess that's why it makes the most sense for them to span multiple lines until they are closed. I'm not sure how quotes work in a lot of other languages, but I know they're closed in Japanese much the same way (but with different symbols than quotation marks).

That's not to say that stopping them at the end of a line is a bad idea

I'm sure it helps a lot to prevent new users from making a bad

mistake, seeing a messed up page, and giving up.

In the end, I'm writing a new parser for Wikipedia, not for myself. If everybody thinks it should end at newlines, I can make it do that, and that will be that. :)

There are other reasons to kill them at line-ends. Primarily, the block-level elements like lists and s and <pre>s are defined by the first character of each line; allowing '' to span lines would require that we close and re-open them at paragraph boundaries to stay valid HTML, and that's complicated and error-prone. Second, just /defining/ proper behavior requires specifying some maximum scope; otherwise, things like '' a ''' b '' c ''' d '' ... will just stack up without closing. If we define them to close at line-end (and I'd further define them to stack at most two levels), then they're easier to cleanly specify.

Nick Reinking

9 May 9 May

2:20 a.m.

New subject: Chat about Wikipedia performance?

On Thu, May 08, 2003 at 01:08:58PM -0500, Lee Daniel Crocker wrote:

...

...
(Nick Reinking nick@twoevils.org):

Anyways, what I want to clarify. To me, when my mind encounters '' or ''', it thinks, "Ooo! Quotation marks that make stuff bold!". My mind is used to closing quotation marks, so I guess that's why it makes the most sense for them to span multiple lines until they are closed. I'm not sure how quotes work in a lot of other languages, but I know they're closed in Japanese much the same way (but with different symbols than quotation marks).

That's not to say that stopping them at the end of a line is a bad idea

I'm sure it helps a lot to prevent new users from making a bad

mistake, seeing a messed up page, and giving up.

In the end, I'm writing a new parser for Wikipedia, not for myself. If everybody thinks it should end at newlines, I can make it do that, and that will be that. :)

There are other reasons to kill them at line-ends. Primarily, the block-level elements like lists and s and <pre>s are defined by the first character of each line; allowing '' to span lines would require that we close and re-open them at paragraph boundaries to stay valid HTML, and that's complicated and error-prone. Second, just /defining/ proper behavior requires specifying some maximum scope; otherwise, things like '' a ''' b '' c ''' d '' ... will just stack up without closing. If we define them to close at line-end (and I'd further define them to stack at most two levels), then they're easier to cleanly specify.

So, should I go ahead implementing the C parser to handle the current Wikitext, or should I wait until we have an actual specification?

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

Lee Daniel Crocker

2:27 a.m.

New subject: Chat about Wikipedia performance?

...

(Nick Reinking nick@twoevils.org):

So, should I go ahead implementing the C parser to handle the current Wikitext, or should I wait until we have an actual specification?

I'm generally in the code-first ask-questions-later school, but since the act of coding brings up questions, we might as well try to deal with them when they come up.

Brion Vibber

8 May 8 May

10:07 p.m.

New subject: Chat about Wikipedia performance?

On Thu, 2003-05-08 at 06:58, Nick Reinking wrote:

...

The Howto page certainly makes it look like you have to enclose your text, and in all the pages I've edited, I've never come across a page that leaves open emphasis marks.

That's right. If you don't have both close and open marks, they're left as literal '' and ''' sequences.

...

Also, there are lots of line spanning constructs in wikitext. <pre>, <nowiki>, <tr>, <td>, etc. Now, I _know_ that (with the exception of <nowiki>) that these are HTML constructs (and not Wikitext constructs), but to the average user, they are exactly the same. So, some things span multiple lines, and some things don't. I think that is confusing.

Yes, that's why we should destroy the pseudo-HTML and make things consistent and happy. :)

-- brion vibber (brion @ pobox.com)

Lee Daniel Crocker

11:31 p.m.

New subject: Chat about Wikipedia performance?

...

(Nick Reinking nick@twoevils.org):

Also, there are lots of line spanning constructs in wikitext. <pre>, <nowiki>, <tr>, <td>, etc. Now, I _know_ that (with the exception of <nowiki>) that these are HTML constructs (and not Wikitext constructs), but to the average user, they are exactly the same. So, some things span multiple lines, and some things don't. I think that is confusing.

HTML things span lines; most of those we can eventually eliminate. We'll probably always be stuck with <nowiki>, but all of the others you mention above are totally unnecessary. And there will be a much better way to emphasize whole paragraphs using style elements.

Wiki syntax is an evolving language, but I'm determined to make it a clean, well-specified, useful, powerful, and consistent one instead of the current hodgepodge.

Nick Reinking

16 May 16 May

10:41 p.m.

New subject: Chat about Wikipedia performance?

...

...
...
Wiki syntax is a line-based syntax. There is /no/ wiki markup that spans lines. It makes editing much simpler: if you make a mistake and forget to close something, it gets closed off quickly. HTML is not designed to be human-editable; wiki syntax is.

I'm having a bit of trouble implementing the C parser because the Wikitext parser has a lot of quirks. For example, you say that no wiki markup spans lines, but if you take a look at: http://www.wikipedia.org/wiki/User:Marumari/Wikitext_Rendering_Quirks you can see that headers do span lines.

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

Lee Daniel Crocker

11:34 p.m.

New subject: Chat about Wikipedia performance?

...

(Nick Reinking nick@twoevils.org):

...
...
...
Wiki syntax is a line-based syntax. There is /no/ wiki markup that spans lines. It makes editing much simpler: if you make a mistake and forget to close something, it gets closed off quickly. HTML is not designed to be human-editable; wiki syntax is.

I'm having a bit of trouble implementing the C parser because the Wikitext parser has a lot of quirks. For example, you say that no wiki markup spans lines, but if you take a look at: http://www.wikipedia.org/wiki/User:Marumari/Wikitext_Rendering_Quirks you can see that headers do span lines.

It's not easy, I agree. I wasn't aware that headers could span lines; I'm not sure whether or not I like that. I don't think so offhand. You'll also notice that I changed my mind about quotes--I think it's probably true that users will be somewhat surprized by forcing quotes to close on every line break, so in my long-range vision I closed them only at paragraph end.

My suggestion to an implementor is this: if the current code has quirks, it's largely because the expected behavior is undefined, so don't be afraid to define it yourself--if your definition makes sense, and doesn't screw up too many existing pages, it will likely be adopted.

Hunter Peress

8 May 8 May

8:46 a.m.

New subject: Chat about Wikipedia performance?

nick, you did not reply to my request that you perform php tests of the same materials on your computer as it the control environment.

__________________________________ Do you Yahoo!? The New Yahoo! Search - Faster. Easier. Bingo. http://search.yahoo.com

Nick Reinking

9:40 a.m.

New subject: Chat about Wikipedia performance?

On Wed, May 07, 2003 at 08:16:24PM -0700, Hunter Peress wrote:

...

nick, you did not reply to my request that you perform php tests of the same materials on your computer as it the control environment.

I did, but I'm afraid I don't understand the current code well enough to do that. If somebody wants to send me a PHP fragment that can open up a file and do everything but link checking, I'd be happy to run control tests.

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

tarquin

4:03 a.m.

New subject: Chat about Wikipedia performance?

Nick Reinking wrote:

...

...
...
I'm actually in the middle of a C project to reduce the wikitext parser to a two-pass parser...

sorry to drop in late in the conversation... I'd not been following this one.

I mentioned a while ago that on Unreal Wiki we completely replaced UseMod's wikitext parser. It's a Perl module that's OO and very easy to extend.

I'm sure the guy who wrote it, Mych, would let Wikipedia use it if interested.

See http://wiki.beyondunreal.com/wiki/Wookee

Brion Vibber

4:37 a.m.

New subject: Chat about Wikipedia performance?

On Wed, 7 May 2003, Nick Reinking wrote:

...

Just to update everybody on my progress with the C wikitext parser:

To do:

Lists of any sort

*shudder* :)

...

Sections, subsections, and subsubsections (==, ===, and ==== respectively)

Should work from = to ====== (h1 to h6).

...

Emphasis, strong emphasis, and very strong emphasis ('', ''', and ''''')

Make sure the following cases work (and produce correct HTML, unlike our current code): ''italic '''bold-italic''' italic'' '''''bold-italic''' italic'' ''italic '''bold-italic'''''

'''bold ''bold-italic'' bold''' '''''bold-italic'' bold''' '''bold ''bold-italic'''''

...

Must be done by PHP:

Handle links / link lookup

Ignore links in <nowiki>

<nowiki> and <math> sections should probably be pulled out _before_ parsing, and their contents processed and reinserted after parsing.

...

When Wikitext is pulled from the database, what are the newlines? Are they always \n?

They sure should be...

...

Also, what format is the wikitext stored in the database as? UTF-8? UTF-16?

At the moment, ISO-8859-1 for the following languages: English, Danish, German, French, Dutch, Spanish, Swedish

UTF-8 for everything else that's on phase 3. The remaining latin-1s will get bumped up to UTF-8 at some point, once someone gets around to ensuring that it won't break with browsers that are violently unfriendly to editing UTF-8 text in forms.

-- brion vibber (brion @ pobox.com)

Hunter Peress

5:19 a.m.

New subject: Chat about Wikipedia performance?

Nick: as Lee discusses but does not support, could you also produce php parsing results from you machine (to keep things in a control).

tarquin: perl would not be any faster than php..

__________________________________ Do you Yahoo!? The New Yahoo! Search - Faster. Easier. Bingo. http://search.yahoo.com

Nick Reinking

7:27 a.m.

New subject: Chat about Wikipedia performance?

On Wed, May 07, 2003 at 04:07:05PM -0700, Brion Vibber wrote:

...

On Wed, 7 May 2003, Nick Reinking wrote:

...
Just to update everybody on my progress with the C wikitext parser:

To do:

Lists of any sort

*shudder* :)

Yep, that's why I've put it off for last.

...

...

Sections, subsections, and subsubsections (==, ===, and ==== respectively)

Should work from = to ====== (h1 to h6).

Okay, I didn't know that. Easy enough to add, though. :)

...

...

Emphasis, strong emphasis, and very strong emphasis ('', ''', and ''''')

Make sure the following cases work (and produce correct HTML, unlike our current code): ''italic '''bold-italic''' italic'' '''''bold-italic''' italic'' ''italic '''bold-italic'''''

'''bold ''bold-italic'' bold''' '''''bold-italic'' bold''' '''bold ''bold-italic'''''

That produces:

italic bold-italic italic bold-italic italic italic bold-italic

bold bold-italic bold bold-italic bold bold bold-italic

...

...
Must be done by PHP:

Handle links / link lookup

Ignore links in <nowiki>

<nowiki> and <math> sections should probably be pulled out _before_ parsing, and their contents processed and reinserted after parsing.

Don't worry, it is very easy and cheap for me to ignore them, and do the converstion inside of <nowiki>.

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

Toby Bartels

10:01 a.m.

New subject: Chat about Wikipedia performance?

Nick Reinking wrote:

...

Converts < > and & inside <nowiki>

& shouldn't be treated specially inside <nowiki>. Character entities aren't wiki markup; they're bona fide HTML character entities. (And if you don't buy that argument -- it will change the behaviour of several pages, not least [[en:Wikipedia:How to edit a page]].)

-- Toby

Nick Reinking

7:29 p.m.

New subject: Chat about Wikipedia performance?

On Wed, May 07, 2003 at 09:31:29PM -0700, Toby Bartels wrote:

...

Nick Reinking wrote:

...

Converts < > and & inside <nowiki>

Whoops, sorry. I misinterpreted what I saw on the howto page. It's been fixed. :)

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

7908

Age (days ago)

7922

Last active (days ago)

wikitech-l@lists.wikimedia.org

28 comments

10 participants

tags (0)

participants (10)

Brion Vibber
Brion Vibber
David A. Wheeler
Hunter Peress
John R. Owens
Lee Daniel Crocker
Nick Reinking
tarquin
Toby Bartels
Tomasz Wegrzanowski