An alternate parser

List overview All Threads
Download

newer

older

Developer poll results

Re: [Wikitech-l] Re: [Wikipedia-l]...

Magnus Manske

14 Aug 2004 14 Aug '04

1:46 a.m.

Warning: Yeat Another Crazy Idea of Mine ahead. If you're sick of these (by bitter experience;-) delete this mail *now*.

Still here? Great!

OK, we all know that the current parser, while working, is not the final word. It is kinda slow due to multi-pass, the source is confusing, and there are some persistant bugs in it, like the template malfunctions.

I therefore suggest a new structure: 1. Preprocessor 2. Wiki markup to XML 3. XML to (X)HTML

Let me go through these. The preprocessor would basically do the template and variable stuff; dumb text replacement, like a C/C++ preprocessor. It would generate the complete wiki-markup text, which is then carefully chewed by the to-XML-coverter. The XML output is then converted into HTML/XHTML for display.

Why have this XML step in there? Couple of reasons.

I think of XML which can be generated from the wiki text /without any knowledge of the rest of the database/. The converter will not check, "does this article exist" or "is this an internal link, an image, an interwiki link, or what?". It should only convert "[[this|or that]]" to "<wikilink page='this'>or that</wikilink>". Also, it should produce only valid XML, no matter the input.

IMHO that would separate the actual *parsing* of the wiki markup from its *meaning*. There are useful methods, functions, libraries and-what-not for dealing with XML.

To have a few points working in favor of this idea: * The wiki parser (#2) can be clean, brief and efficient without worrying about the context of the page; it can focus on parsing wiki markup. * The XML-HTML-converter (#3) can focus on the pure context of the page: make a normal link, stub link, thumbnailed image, or a category or interwiki link, etc. * Both can be developed and maintained independently. To make thumbnail display behave differently, I won't have to look at the dirty wiki markup parsing at all ;-) * We could have a decent XML output function at no cost. * Other wikis, used to a different syntax, could easily switch to MediaWiki by adapting the wiki-to-XML module.

For wiki-to-XML, we could even use an external parser. I am currently toying around with one in C++ (yes, another one, and yes, one could probably write it in two lines of Perl. So, go ahead!;-)

I *do* realize that this would mean a great change in our current software. Therefore, I do not demand for this to be implemented in 1.3.1 :-) But, in the long run, I strongly believe that this is the way to go. The performace loss from doing two steps instead of one might even be compensated for by increased performance of each specialized part. The value of making the parser more modular, however, should IMHO not be underestimated.

Magnus

P.S.: Hurricane "Charley" - Britannica's last hope... ;-)

Show replies by date

Brion Vibber

14 Aug 14 Aug

1:59 a.m.

Magnus Manske wrote:

...

I therefore suggest a new structure:

Preprocessor

Wiki markup to XML

XML to (X)HTML

This doesn't actually solve any of the issues with the current parser, since it merely has it produce a different output format.

The main problems are that we have a mess of regexps that stomp on each other all the time.

-- brion vibber (brion @ pobox.com)

Ashar Voultoiz

2:38 a.m.

Brion Vibber wrote:

...

Magnus Manske wrote:

...
I therefore suggest a new structure:

Preprocessor

Wiki markup to XML

XML to (X)HTML

This doesn't actually solve any of the issues with the current parser, since it merely has it produce a different output format.

The main problems are that we have a mess of regexps that stomp on each other all the time.

-- brion vibber (brion @ pobox.com)

Can't we switch back to the tokenizer parser and try to optimize it ? The token approch Seems much easier to maintain.

-- Ashar Voultoiz

Jens Frank

6:08 p.m.

On Fri, Aug 13, 2004 at 09:38:34PM +0200, Ashar Voultoiz wrote:

...

Brion Vibber wrote:

...
Magnus Manske wrote:

...
I therefore suggest a new structure:

Preprocessor

Wiki markup to XML

XML to (X)HTML

This doesn't actually solve any of the issues with the current parser, since it merely has it produce a different output format.

The main problems are that we have a mess of regexps that stomp on each other all the time.

-- brion vibber (brion @ pobox.com)

Can't we switch back to the tokenizer parser and try to optimize it ? The token approch Seems much easier to maintain.

Character-by-character string parsing in PHP is slow since there is too much overhead. Tokenizing probably has to be done in a C(++) function.

Another point where the tokenizer was slow was the byte-by-byte composing of the result string. I've been told that adding small strings to an array and joining them in the end is much faster, worth a try.

JeLuF

Magnus Manske

7:26 p.m.

Jens Frank wrote:

...

<>Another point where the tokenizer was slow was the byte-by-byte composing of the result string. I've been told that adding small strings to an array and joining them in the end is much faster, worth a try.

It is slow because for each "add", memory has to be re-allocated. In C++/STL, one can reserve memory in advance for the string. Can that be done in PHP as well? It would speed up things a lot with a single extra line.

Magnus

Magnus Manske

3:21 a.m.

Brion Vibber wrote:

...

Magnus Manske wrote:

...
I therefore suggest a new structure:

Preprocessor

Wiki markup to XML

XML to (X)HTML

This doesn't actually solve any of the issues with the current parser, since it merely has it produce a different output format.

Of course, it wouldn't solve any issues with the current parser; it would be another parser altogether, with shiny new issues of its own ;-)

...

The main problems are that we have a mess of regexps that stomp on each other all the time.

Yes, but we could put what I call #2 into its own class, which could use regexps or not, or into an external program. The wiki2xml C++ parser I am working on doesn't use regexps at all. Or we could use one of these weird compiler-generating languages, or parser-generating ones, if there is such a thing. The point is, it will be spearate (read: independent) from the rest of the software, which will simplify things enormously, IMHO.

I think a huge part of the problem is that we generate HTML for some wiki markup, then get in conflict with outselves on another part. Another part is the interweaving of parsing and database operations. By parsing, checking numerous context conditions, querying the database, and generating HTML, all in a single function, we get lost, along with the parser.

In my proposal, we concentrate on pure markup-to-markup conversion (wiki-to-XML) in one module, and on pure tag *interpretation* in the next. IIRC, PHP already has built-in XML functions. When we enter #3 of my proposal, we can rely on having perfectly vaild XML, so weird regexp effects are likely to be a thing of the past there as well.

Back when I started Phase II, I didn't know exactly where I was headed, so I added feature upon feature, up to a point where the code was a mess. Lee Daniel Crocker (are you still out there? There are still some bugs in the sourceforge bugtracker with your name on them;-) had mercy and restructured the code to what became Phase III. Because of that cleanup, that restructuring of things, we were able to get some speed back into the code.

But when I look at Parser.php now, I see a similar mess as at the end of Phase II. Yes, it is all OOP (or what PHP thinks OOP is:-) now, but the functions are one nightmare of if-clauses after another. When I tried to join the bughunt, I became increasingly afraid to touch the stuff. Changes in one function had effects on others where there was no obvious reason for that. Granted, I didn't keep up-to-date with the development as I should have, but I doubt that's the reason for the (rather emotional than rational) effect I experienced.

Things have already developed into the direction I proposed. Look at the internal link parsing. It is parsing the link, thereby making a list of parameters (for images), *then* analysing the parsed link and its parameters. But that function alone has 160 lines! My C++ function (with C++ less well suited for string manipulation) to convert wiki markup into XML has 70 lines, and it already parses nested links; no need to call it two times, as we currently do (yuck!).

The need to *then* render (X)HTML won't go away magically; but it can be done based on valid XML, which should be *much* cleaner than what we do now.

Magnus

Krzysztof Kowalczyk

3:46 a.m.

...

...
...
I therefore suggest a new structure:

Preprocessor

Wiki markup to XML

XML to (X)HTML

Why XML for intermediate part? As I understand it, speed is important for WikiPedia code given it needs to scale to a very high usage. Servers are always overloaded.

Using XML for intermediate form would have a significant negative impact on speed and memory usage (compared to more efficient implementation strategy, which is always possible).

Given that there is no requirement to serialize to XML (at least not during internal wiki->html conversion), serializing to XML and then parsing this serialized form is pure overhead that can be avoided.

There are many ways to have benefits of XML (structured data) without incurring the overhead of XML.

For example, you can ship objects around. If you know your XML schema and it doesn't change, you can always create an equivalent object hierarchy.

For example: <wikiPage> <title>Title of the page</title> <body>This is body</body> </wikiPage>

can be represented with object wikiPage() that has members title and body. You can extrapolate this example to any XML with schema known up-front. Then you can ship those objects around. It's as clean approach (the abstraction provided is the same, it's just a different, more efficient way of providing this abstraction to clients).

But as I write it, I see even less reason to use XML - it's just not a good format to represent wiki markup structure. Can you give an example of how this XML would look like for some simple wiki markup? Further discussion about speed/space/code cleanness trade-offs is a bit hard without knowing more details about proposed approach - it's vague enough to have more than one interpretation.

Krzysztof Kowalczyk | http://blog.kowalczyk.info

Magnus Manske

4:12 a.m.

Krzysztof Kowalczyk wrote:

...

...
...
...
I therefore suggest a new structure:

Preprocessor

Wiki markup to XML

XML to (X)HTML

Why XML for intermediate part? As I understand it, speed is important for WikiPedia code given it needs to scale to a very high usage. Servers are always overloaded.

My reasons for XML: * Standardized (for other applications, proven parsers available, easier for new developers to deal with) * IMHO best way to pass data between a #2 external parser and MediaWiki * Basing #3 on guaranteed valid XML will ease generation of valid XHTML a lot

...

For example:

<wikiPage> <title>Title of the page</title> <body>This is body</body> </wikiPage>

can be represented with object wikiPage() that has members title and body. You can extrapolate this example to any XML with schema known up-front. Then you can ship those objects around. It's as clean approach (the abstraction provided is the same, it's just a different, more efficient way of providing this abstraction to clients).

Yes, that could be done, but it would require to stay in PHP. Also, I (personally) like to stay in text-to-text conversion for #2, but that's just my taste.

...

But as I write it, I see even less reason to use XML - it's just not a good format to represent wiki markup structure. Can you give an example of how this XML would look like for some simple wiki markup? Further discussion about speed/space/code cleanness trade-offs is a bit hard without knowing more details about proposed approach - it's vague enough to have more than one interpretation.

Well, something like

[[image:bla.jpg|thumb=bla_small.jpg|150px|Text and [[a link]]]]trail

would become

<wikilink page='image:bla.jpg' thumb='bla_small.jpg' width='150px'>Text and <wikilink page='a link'>a link</wikilink>trail</wikilink>

That's actual output of my parser. I will put the source on the net somewhere once it has all major functions. Missing at this point are: * Handling of <nowiki> (and <pre>, respectively) * XML validating (that's going to be the hardest part;-) * table markup * external links and "brute force" markup like ISBNs

Otherwise, it is functional already. It even does the ''italics '''italics bold''''' thingy right ;-) Haven't tested it on "real" wikipedia pages yet, though.

Magnus

Ivan Krstic

4:32 a.m.

I just glanced at the codebase to take a look exactly what gets cached. If my understanding is correct (and it's possible that it's not, since I only looked for about 5 minutes), the article is reparsed by the parser on every page view[1].

Since disk space is cheap, and disks are relatively fast, why not do the following (it's assumed that a dedicated box is designated the "read" database box):

1. User begins to edit page 2. Software loads mediawiki-markup page from "write" database, displays it 3. User makes changes, submits changes 4. Mediawiki stores the page, verbatim, into the "write" database 5. Mediawiki also runs the page through Parser, takes the raw HTML, and feeds it to a separate, "read" database

Subsequently, any and all article views are routed to the "read" database, reducing the time for all non-edit pageviews to a mere database fetch, which is further performance-optimized by virtue of memcached and the MySQL query cache, and can be cheaply made faster through RAID striping and similar measures.

Of course, this depends on no user-configurable settings changing the HTML once it's produced. That's in direct conflict with at least one feature that I'm aware of, which is the different style of showing non-existent links, but there are two solutions to this: one, scrap the feature for a massive performance boost, and two, have a special miniparser that only parses the raw HTML from the "read" database for those per-user settings which modify the HTML output.

Either way, the performance penalty of the parser is reduced to a one-time hit when the article is being committed, and the "our parser is slow" discussion becomes moot.

Thoughts? If this is what we need, I might be nudged into providing a patch.

Cheers, Ivan.

[1] For simplicity, I'm leaving the squids out of the equation here.

Brion Vibber

4:44 a.m.

Ivan Krstic wrote:

...

User begins to edit page

Software loads mediawiki-markup page from "write" database, displays it

User makes changes, submits changes

Mediawiki stores the page, verbatim, into the "write" database

Mediawiki also runs the page through Parser, takes the raw HTML, and

feeds it to a separate, "read" database

Please take a look at the parser cache and see if it's missing any needed functionality.

-- brion vibber (brion @ pobox.com)

Ivan Krstic

5:01 a.m.

Brion Vibber wrote:

...

Please take a look at the parser cache and see if it's missing any needed functionality.

What are the current cache hit/miss rates like? I wrote the last post based on the premise that memcached only makes it possible to cache a fraction of the article space, still making the parser churn quite a bit, making people unhappy. If this is not the case, what are people complaining about?

With another database as opposed to memcached, it's possible to literally reduce things to one parser hit per edit, for all languages. The parser would then need to be *really* slow for anyone to notice a performance penalty, no?

Cheers, Ivan.

Brion Vibber

5:08 a.m.

Ivan Krstic wrote:

...

Brion Vibber wrote:

...
Please take a look at the parser cache and see if it's missing any needed functionality.

What are the current cache hit/miss rates like? I wrote the last post based on the premise that memcached only makes it possible to cache a fraction of the article space, still making the parser churn quite a bit, making people unhappy.

I don't know what our current hit rate looks like, but memcached can be arbitrarily large since it is spread across many machines. (The cluster's total memory capacity is something like 40 gb.)

The parser cache iirc does have to parse _differently_ for users with different options, so I don't know how much duplication there is.

-- brion vibber (brion @ pobox.com)

Ivan Krstic

5:48 a.m.

Brion Vibber wrote:

...

The parser cache iirc does have to parse _differently_ for users with different options, so I don't know how much duplication there is.

This is probably worth looking into. With duplication and constant growth working against you, 40GB isn't as much as it seems. I'm more interested in your opinion about whether the one parser hit on edit would solve the parser performance issues altogether, since it takes any guesswork out of the picture and scales very easily.

Ivan.

Brion Vibber

9:43 a.m.

Ivan Krstic wrote:

...

Brion Vibber wrote:

...
The parser cache iirc does have to parse _differently_ for users with different options, so I don't know how much duplication there is.

This is probably worth looking into. With duplication and constant growth working against you, 40GB isn't as much as it seems. I'm more interested in your opinion about whether the one parser hit on edit would solve the parser performance issues altogether, since it takes any guesswork out of the picture and scales very easily.

I'm sure it would help some, though I don't know how much as I don't have the hit ratio numbers.

Making it have to parse on edit _only_ would require a number of specific things though such as:

* Parsing has to be independent of user options and settings. Things like math rendering options need to alter only a later stage of output than what is cached.

* Variable substitutions like the current date and number of articles must be kept for later. Note that people like to use the date variables in links for 'X of the day' type features; this causes some niggling trouble with link consistency.

* Template substitution must similarly be delayed. If the templates are pre-parsed though it could be easy to just grab the template's parse tree and stick it in to the appropriate place. Caveat: templates have parameters, with the same kinds of problems as variable substitutions (they are often used in links)

And of course, if we change the parsing rules we need to be able to clear out the cache, either through automatic versioning or some other scheme. I tend to favor versioning and checking the cache for currentness at load time and saving it then if necessary; that's how we deal with most such things.

-- brion vibber (brion @ pobox.com)

Jan Hidders

5:56 a.m.

On Friday 13 August 2004 22:46, Krzysztof Kowalczyk wrote:

...

...
...
...
I therefore suggest a new structure:

Preprocessor

Wiki markup to XML

XML to (X)HTML

Why XML for intermediate part? As I understand it, speed is important for WikiPedia code given it needs to scale to a very high usage. Servers are always overloaded.

A parser generated with yacc is extremely fast and needs very little memory. The same holds for the XML parser in PHP. Since you don't want to write the parser in PHP you have an interface problem anyway (you cannot simply exchange a PHP data structure) and this would actually be a very efficient solution.

-- Jan Hidders

Timwi

6:43 a.m.

Magnus Manske wrote:

...

Or we could use one of these weird compiler-generating languages, or parser-generating ones, if there is such a thing. The point is, it will be spearate (read: independent) from the rest of the software, which will simplify things enormously, IMHO.

Personally, I am very much in favour of using such a parser generator. Some time ago, someone here on this mailing list already proposed this, but I can't find it now. One major advantage of this would be that we can extend the grammar and have the parser be generated from it. We would no longer have to actually tweak the parser. Other advantages included efficiency, and simply the assurance that this is the "correct way", because all other professional applications do it this way.

The process is usually broken down into four phases:

(1) lexing -- turn raw text into series of tokens (2) parsing -- turn series of tokens into parse tree (3) processing (4) compiling -- turn processed parse tree into requested output format

This is extremely general; this whole procedure can apply to programming language compilers (e.g. gcc), markup processors (e.g. browsers, LaTeX) and pretty much anything else that turns a text file in one format into some other format (not necessarily text: in the case of a compiler, it would be executable code). Because of this generality, many tools to perform these tasks already exist. In the case of step 1, this is what "lex" does. Step 2 is the field of expertise of parser generators such as "yacc" or its free-software equivalent "bison". These are C-centric in the sense that they output C code; I'm sure PHP ones exist, but maybe we want to use C for efficiency anyway. Steps 3 and 4 are application-dependent, so they are programmed manually, but given a parse tree, they are easy.

The "process" step is particularly application-dependent; in the case of a programming language compiler, for example, it might perform optimisations. In our case, it means:

(a) find template inclusions, recursively call this entire process with the template's wiki text and replace the template inclusion with the resulting parse tree; (b) find links and determine if the page they point to is non-existent, a stub, etc., and "annotate" the parse tree accordingly; (c) probably other little things I haven't thought of.

I would be more than willing to help with this, especially steps 3 and 4 :-), but since I have absolutely no experience with lex or bison, I would need some help with those.

Have I mentioned yet that this is the only correct way to do this? :-)

Timwi

Jan Hidders

5:42 a.m.

On Friday 13 August 2004 20:59, Brion Vibber wrote:

...

Magnus Manske wrote:

...
I therefore suggest a new structure:

Preprocessor

Wiki markup to XML

XML to (X)HTML

This doesn't actually solve any of the issues with the current parser, since it merely has it produce a different output format.

The main problems are that we have a mess of regexps that stomp on each other all the time.

Are you kidding? That is exactly what it would solve! If you would let the preprocessor be generated with a lex/yacc type of tool then you would for the first time have a decent formal documentation of the wiki-syntax in the form of a context-free grammar. That not only would give you a better idea of what the wiki-syntax exactly is and tell you exactly whether any new mark-up interferes with old mark-up, but you could also more easily add context-sensitive rules (like replacing 2 dashes with — but only in normal text). Moreover it would give you the power to make small changes to the mark-up language because you could easily generate a parser that translates all old texts to the new mark-up. Finally, having an explicit grammar also makes it more easy to make sure that you actually generate well-formed and valid XHTML, or anything else that you would like to generate from it and that needs somehow to satisfy a certain syntax.

It's simply a briliant idea, and frankly I think it is in the long run as unavoidable as the step to a database-backend. If there is performance problem you could even consider storing the XML in the database so you only need do the raw parse at write time and the xml parse at read time.

That hard part is of course to come up with the contex-free grammar (it should probably be LALR(1) at that). Since I used to teach compiler theory I might be of some help there.

-- Jan Hidders

PS. You could even get rid of the OCaml code since the Latex parsing could be integrated in the general parser.

Brion Vibber

5:49 a.m.

Jan Hidders wrote: [snip]

...

That hard part is of course to come up with the contex-free grammar (it should probably be LALR(1) at that). Since I used to teach compiler theory I might be of some help there.

Yes, and that's the *only* part that will help. Having two or three or five intermediate formats doesn't do anything to help the problem -- making the actual parser actually work token by token will.

IMHO putting a lot of emphasis on output formats is a mistake, since it ignores the actual problem.

-- brion vibber (brion @ pobox.com)

Jan Hidders

6:27 a.m.

On Saturday 14 August 2004 00:49, Brion Vibber wrote:

...

Jan Hidders wrote: [snip]

...
That hard part is of course to come up with the contex-free grammar (it should probably be LALR(1) at that). Since I used to teach compiler theory I might be of some help there.

Yes, and that's the *only* part that will help. Having two or three or five intermediate formats doesn't do anything to help the problem -- making the actual parser actually work token by token will.

IMHO putting a lot of emphasis on output formats is a mistake, since it ignores the actual problem.

FWIW I completely agree with that. Having a real parser in C/C++ generated by a parser-generator is IMHO the core of the idea, and if you need to interface that somehow with the rest of the PHP code then XML is a good solution. I had the impression that this is what Magnus meant anyway, but I may have been wishful reading there.

-- Jan Hidders

Magnus Manske

5:08 p.m.

Jan Hidders wrote:

...

On Saturday 14 August 2004 00:49, Brion Vibber wrote:

...
Jan Hidders wrote: [snip]

...
That hard part is of course to come up with the contex-free grammar (it should probably be LALR(1) at that). Since I used to teach compiler theory I might be of some help there.

Yes, and that's the *only* part that will help. Having two or three or five intermediate formats doesn't do anything to help the problem -- making the actual parser actually work token by token will.

IMHO putting a lot of emphasis on output formats is a mistake, since it ignores the actual problem.

FWIW I completely agree with that. Having a real parser in C/C++ generated by a parser-generator is IMHO the core of the idea, and if you need to interface that somehow with the rest of the PHP code then XML is a good solution. I had the impression that this is what Magnus meant anyway, but I may have been wishful reading there.

No, you got that right :-)

If we'd limit ourselves to PHP, we could avoid XML, but PHP is not very well suited for this kind of text parsing, even with regexps, especially when it comes to performance. Also, using XML as an itermediate at one point opens potential for exchange with other services, either for output (generating static dumps, statistic analyses, etc.) or input (e.g., using other wiki markup in MediaWiki, for other projects).

The one I'm currently working on is "manual" C++, but I see no reason not to use an parser-generator. Question: Can a parser-generator ensure the output (no matter the input) is valid XML? Can it remove potential harmful HTML tags? If not, is there a tool (tidyhtml comes to mind) that can? The chain would then be preprocessing-parsing-XML-XHTML.

The XML could be cached, as all changes influenced by user options would happen only in the final step. Caveat: Cache will have to be invalidated for variables and templates that change (e.g., {{NUMBEROFARTICLES}} and edited templates).

Once the #2 parser is basically working, we (I) can adapt a copy of the parser class and run some benchmarks against the current parser. That should give us some more facts to base a decision on.

Magnus

Krzysztof Kowalczyk

15 Aug 15 Aug

1:40 a.m.

...

The one I'm currently working on is "manual" C++, but I see no reason not to use an parser-generator. Question: Can a parser-generator ensure the output (no matter the input) is valid XML? Can it remove potential harmful HTML tags?

Looks like you mis-understood what a parser generator does.

Based on a formal grammar, parser generator (like yacc or bison) generates parser for this grammar. It does nothing else. It's up to you to write code/actions for grammar parts (in this outputting XML text).

So the answer is: no.

Krzysztof Kowalczyk | http://blog.kowalczyk.info

Timwi

2:57 a.m.

Magnus Manske wrote:

...

If we'd limit ourselves to PHP, we could avoid XML, but PHP is not very well suited for this kind of text parsing, even with regexps, especially when it comes to performance. Also, using XML as an itermediate at one point opens potential for exchange with other services, either for output (generating static dumps, statistic analyses, etc.) or input (e.g., using other wiki markup in MediaWiki, for other projects).

I can sympathise with this line of reasoning. The lexing and parsing needs to be done in C/C++ for efficiency, but then the generated parse tree should be processed and compiled by PHP, so we need to "transfer" it, and XML does that job.

...

The one I'm currently working on is "manual" C++, but I see no reason not to use an parser-generator. Question: Can a parser-generator ensure the output (no matter the input) is valid XML?

Krzysztof Kowalczyk already tried to explain this, but I'll put it in different words. The parser generator doesn't "ensure" the output is valid XML; the programmer has to do that. But it's easy. A parser generator generates the code to turn wikitext into a parse tree; to go from a parse tree to a valid-XML representation of that parse tree is like 1+1=2. It's trivial.

Do you know what a parse tree is? If not, let me know, and I'll try to explain that to you too.

...

Can it remove potential harmful HTML tags?

Don't think of it as "removing". Don't think of the whole process as turning one text directly into another. It's not like that.

You give a parser generator a "grammar" (see [[formal grammar]]). If this grammar says that "<marquee>" isn't a syntax element, then it will be interpreted as text, and stored as such in the parse tree. Later (much later), when the parse tree is compiled into actual (X)HTML, the text would (obviously) be HTML-escaped. So it would become <marquee>.

Alternatively, of course, one can explicitly define <marquee> to be a null syntax element in the grammar, so that the final output doesn't contain it. But that shouldn't be necessary.

...

The chain would then be preprocessing-parsing-XML-XHTML.

Uhm... no. See: http://mail.wikipedia.org/pipermail/wikitech-l/2004-August/012135.html where I describe the process.

You don't want to do anything to the text before parsing it (assuming here that parsing includes lexing, although technically they're separate steps). You want to do all processing after parsing.

Why? Well, because this is the purpose of parsing. We want to turn the text into a data structure that computers can handle better than text. It is much easier and much less error-prone to say "if this object is a tree node representing template-inclusion, then do this" than to say "search the string for some fuzzy pattern that looks a bit like a template inclusion, but look out for nesting, and make sure you get the parameters right, because they might contain pipes inside piped links, and try not to mess things up."

...

The XML could be cached, as all changes influenced by user options would happen only in the final step.

That is correct! But this XML would still be just the parse tree for the wiki text.

...

Caveat: Cache will have to be invalidated for variables and templates that change (e.g., {{NUMBEROFARTICLES}} and edited templates).

This is obvious. We already have to invalidate the cache for everything that changes.

However, currently we also need to invalidate the parser cache for pages that include a template we have edited. This should not be necessary. We should be able to retrieve the parse tree for a page independently of that for the included templates. To put them together at page-view time is not costly, because we have to compile the parse tree into HTML anyway (which is also not costly, but it means one sweep through the parse tree).

...

Once the parser is basically working, we can [...] run some benchmarks against the current parser. That should give us some more facts to base a decision on.

Personally, I'm almost inclined to say that having a proper parser is more important than performance. :-) But I'm confident that it will outperform the current parser by far.

Timwi

Magnus Manske

3:40 a.m.

Timwi wrote:

...

Magnus Manske wrote:

...
The one I'm currently working on is "manual" C++, but I see no reason not to use an parser-generator. Question: Can a parser-generator ensure the output (no matter the input) is valid XML?

Krzysztof Kowalczyk already tried to explain this, but I'll put it in different words. The parser generator doesn't "ensure" the output is valid XML; the programmer has to do that. But it's easy. A parser generator generates the code to turn wikitext into a parse tree; to go from a parse tree to a valid-XML representation of that parse tree is like 1+1=2. It's trivial.

Do you know what a parse tree is? If not, let me know, and I'll try to explain that to you too.

I do know what it is. I already wrote a compiler or two during my informatics classes. The way I put the questions in the original mail was confusing, and I apologize. It was rather meant rhetorical, like in "I don't think it does".

...

...
The chain would then be preprocessing-parsing-XML-XHTML.

Uhm... no. See: http://mail.wikipedia.org/pipermail/wikitech-l/2004-August/012135.html where I describe the process.

You don't want to do anything to the text before parsing it (assuming here that parsing includes lexing, although technically they're separate steps). You want to do all processing after parsing.

By "preprocessing", I mean replacing the {{template}} things with the appropriate text stored in the database. The external parser can't do that, and it can't be done afterwards, when everything is XML already (see below). So, we'll have to do it as the very first step, just like C/C++ does.

...

Why? Well, because this is the purpose of parsing. We want to turn the text into a data structure that computers can handle better than text. It is much easier and much less error-prone to say "if this object is a tree node representing template-inclusion, then do this" than to say "search the string for some fuzzy pattern that looks a bit like a template inclusion, but look out for nesting, and make sure you get the parameters right, because they might contain pipes inside piped links, and try not to mess things up."

Example : {| {{template}} | stuff |} with {{template}} being "bgcolor=#FFFFFF". Wouldn't that be filtered out if we do the template replacement when we're already in XML? Because, XML probably would look something like <table <wikitemplate>template</wikitemplate>> <tr><td>stuff</td></tr></table> if we let the parser loose on the template inclusion, which is *not* valid XML. C/C++ also does "#define" before the actual parser. It is exactly the same as our templates.

Of course, we can explicitly forbid cases like the above.

...

...
The XML could be cached, as all changes influenced by user options would happen only in the final step.

That is correct! But this XML would still be just the parse tree for the wiki text.

Yes, the XML->(X)HTML step (compiling) would have to be done again each time. But I figure that will take way less time than parsing does now.

...

Personally, I'm almost inclined to say that having a proper parser is more important than performance. :-) But I'm confident that it will outperform the current parser by far.

All in due time (pun intended:-)

Magnus

Timwi

5:06 a.m.

Magnus Manske wrote:

...

Timwi wrote:

...
Do you know what a parse tree is? If not, let me know, and I'll try to explain that to you too.

I do know what it is. I already wrote a compiler or two during my informatics classes. The way I put the questions in the original mail was confusing, and I apologize. It was rather meant rhetorical, like in "I don't think it does".

Does that mean the compilers you wrote in your informatics classes didn't produce valid executable code? ;-)

I'm sorry. I'm being a bit silly here. Please ignore that. :)

...

...
You don't want to do anything to the text before parsing it (assuming here that parsing includes lexing, although technically they're separate steps). You want to do all processing after parsing.

By "preprocessing", I mean replacing the {{template}} things with the appropriate text stored in the database.

Yes, I know what you meant. But I still don't think it's necessary to do that before the parsing.

Personally I think your example ("{| {{template}}") should be disallowed because it's clearly evil. The correct way is, of course, to have one whole table in one template, and not parts of scattered around places. But it's still possible:

...

Because, XML probably would look something like

<table <wikitemplate>template</wikitemplate>> <tr><td>stuff</td></tr></table>

No, of course it wouldn't. If it does, then you've done something horribly wrong with your generator code. It should instead look like this:

<table> <attrs><wikitemplate>template</wikitemplate></attrs> <tr><td>stuff</td></tr> </table>

Then the processing step expands all templates:

<table> <attrs>bgcolor="white" border="1"</attrs> <tr><td>stuff</td></tr> </table>

Then the processing step expands all attrs:

<table> <bgcolor>white</bgcolor> <border>1</border> <tr><td>stuff</td></tr> </table>

But as I said, I think it should be disallowed. I think *at most* something like {| bgcolor="{{template}}" should perhaps be allowed, and that would yield:

<table> <bgcolor><wikitemplate>template</wikitemplate></bgcolor> <tr><td>stuff</td></tr> </table>

Then, of course, the processing step will find the "include" block and replace it with, say, "white".

(Of course, it should be noted here that the "table", "tr" and "td" thingies aren't HTML tags, and are conceptually different from the corresponding HTML tags. I'm sure you know this, but I'm pointing this out for our other readers.)

...

C/C++ also does "#define" before the actual parser. It is exactly the same as our templates.

That doesn't mean it's the right thing to do. C/C++ macro pre-processing is widely considered evil. No other programming language uses it to any such significant extent. I am rather opposed to re-using that horrible idea in MediaWiki's syntax.

...

Of course, we can explicitly forbid cases like the above.

Which I would very much prefer we did, but we don't have to.

Greetings, Timwi

Timwi

14 Aug 14 Aug

6:05 a.m.

Magnus Manske wrote:

...

OK, we all know that the current parser, while working, is not the final word. It is kinda slow due to multi-pass, the source is confusing, and there are some persistant bugs in it, like the template malfunctions.

I therefore suggest a new structure:

Preprocessor

Wiki markup to XML

XML to (X)HTML

Let me go through these. The preprocessor would basically do the template and variable stuff; dumb text replacement, like a C/C++ preprocessor.

Well, isn't this going to create exactly the same problems as the ones we have now? We don't want "dumb text replacement" because "dumb text replacement" is "dumb".

...

I think of XML which can be generated from the wiki text /without any knowledge of the rest of the database/. The converter will not check, "does this article exist" or "is this an internal link, an image, an interwiki link, or what?". It should only convert "[[this|or that]]" to "<wikilink page='this'>or that</wikilink>". Also, it should produce only valid XML, no matter the input.

By choosing XML in preference over a native PHP structure, you are essentially introducing unnecessary problems, such as having to ensure XML validity.

...

IMHO that would separate the actual *parsing* of the wiki markup from its *meaning*. There are useful methods, functions, libraries and-what-not for dealing with XML.

Yes, there are useful functions; however, they don't really do anything that is particularly useful to specifically parsing text, or anything as specific as this. It is better to have PHP classes which can have methods that we can define, extend, and modify.

Of course, is of my opinion, etc.

Timwi

evan＠wikitravel.org

15 Aug 15 Aug

2:20 a.m.

On Fri, Aug 13, 2004 at 08:46:20PM +0200, Magnus Manske wrote:

...

The XML-HTML-converter (#3) can focus on the pure context of the page:

make a normal link, stub link, thumbnailed image, or a category or interwiki link, etc.

It'd probably be sufficient to do an XSL style sheet. You could even send XML+XSL to the client for clients that support it, and do server-side XSL for clients that don't.

~ESP

Brion Vibber

2:27 a.m.

evan@wikitravel.org wrote:

...

On Fri, Aug 13, 2004 at 08:46:20PM +0200, Magnus Manske wrote:

...

The XML-HTML-converter (#3) can focus on the pure context of the page:

make a normal link, stub link, thumbnailed image, or a category or interwiki link, etc.

It'd probably be sufficient to do an XSL style sheet. You could even send XML+XSL to the client for clients that support it, and do server-side XSL for clients that don't.

That unfortunately introduces another dependency. PHP has an optional, non-default XSLT module which requires an external library, or you can run through an external program as a filter (requires that one be obtained and be runnable, etc).

-- brion vibber (brion @ pobox.com)

Timwi

3:01 a.m.

evan@wikitravel.org wrote:

...

On Fri, Aug 13, 2004 at 08:46:20PM +0200, Magnus Manske wrote:

...

The XML-HTML-converter (#3) can focus on the pure context of the page:

make a normal link, stub link, thumbnailed image, or a category or interwiki link, etc.

It'd probably be sufficient to do an XSL style sheet. You could even send XML+XSL to the client for clients that support it, and do server-side XSL for clients that don't.

I don't know much about XSL(T), but personally I highly doubt that it can give us the flexibility we need.

evan＠wikitravel.org

3:59 a.m.

On Sat, Aug 14, 2004 at 09:01:04PM +0100, Timwi wrote:

...

...
It'd probably be sufficient to do an XSL style sheet. You could even send XML+XSL to the client for clients that support it, and do server-side XSL for clients that don't.

I don't know much about XSL(T), but personally I highly doubt that it can give us the flexibility we need.

Do you make a habit of forming strong opinions based on near-total ignorance?

~ESP

Timwi

5:07 a.m.

evan@wikitravel.org wrote:

...

On Sat, Aug 14, 2004 at 09:01:04PM +0100, Timwi wrote:

...
...
It'd probably be sufficient to do an XSL style sheet. You could even send XML+XSL to the client for clients that support it, and do server-side XSL for clients that don't.

I don't know much about XSL(T), but personally I highly doubt that it can give us the flexibility we need.

Do you make a habit of forming strong opinions based on near-total ignorance?

Do you make a habit of forming arguments based on ad-hominem attacks?

Gabriel Wicke

17 Aug 17 Aug

9:12 p.m.

On Fri, 2004-08-13 at 20:46 +0200, Magnus Manske wrote:

...

Warning: Yeat Another Crazy Idea of Mine ahead. If you're sick of these (by bitter experience;-) delete this mail *now*.

Still here? Great!

OK, we all know that the current parser, while working, is not the final word. It is kinda slow due to multi-pass, the source is confusing, and there are some persistant bugs in it, like the template malfunctions.

I therefore suggest a new structure:

Preprocessor

Wiki markup to XML

XML to (X)HTML

This is what i'm writing currently, except that the parser will return a dom tree instead of the xml dump of it. Saves another parse step before postpocessing (template replacement, link status updates etc and the final xslt transform). Besides being able to save the dom tree as xml at any stage it's also possible to pickle the python object, which is a bit faster to wake up than xml.

Caveat: Based on python's xml features, don't know a lot about php dom implementations.

-- Gabriel Wicke

Timwi

19 Aug 19 Aug

6:47 p.m.

Gabriel Wicke wrote:

...

This is what i'm writing currently

So how many people are currently working on a new parser? :-)

Timwi

Magnus Manske

20 Aug 20 Aug

12:21 a.m.

Timwi wrote:

...

Gabriel Wicke wrote:

...
This is what i'm writing currently

So how many people are currently working on a new parser? :-)

Never heard of redundancy? We'll get at least *one* parser done, even if some of us go bluescreen in the process ;-)

Magnus

Tim Starling

22 Aug 22 Aug

6:13 p.m.

Magnus Manske wrote:

...

Timwi wrote:

...
Gabriel Wicke wrote:

...
This is what i'm writing currently

So how many people are currently working on a new parser? :-)

Never heard of redundancy? We'll get at least *one* parser done, even if some of us go bluescreen in the process ;-)

I think I like Magnus' one better. Gabriel's might be alright if it was in CVS, but it's not as far as I can see. Will I be stepping on any toes if I do some work on it?

-- Tim Starling

Magnus Manske

10:50 p.m.

Tim Starling wrote:

...

I think I like Magnus' one better. Gabriel's might be alright if it was in CVS, but it's not as far as I can see. Will I be stepping on any toes if I do some work on it?

Caveat: Mine's rather a converter than a parser, though it *should* return similar output (not the same, since we don't use the same XML). In the long run, a real parser will likely become the solution of choice.

Magnus

Timwi

29 Aug 29 Aug

8 a.m.

Tim Starling wrote:

...

Will I be stepping on any toes if I do some work on it?

Any progress so far? :)

Gabriel Wicke

30 Aug 30 Aug

7:14 a.m.

On Sun, 2004-08-22 at 21:13 +1000, Tim Starling wrote:

...

Magnus Manske wrote:

...
Timwi wrote:

...
Gabriel Wicke wrote:

...
This is what i'm writing currently

So how many people are currently working on a new parser? :-)

Never heard of redundancy? We'll get at least *one* parser done, even if some of us go bluescreen in the process ;-)

I think I like Magnus' one better. Gabriel's might be alright if it was in CVS, but it's not as far as I can see. Will I be stepping on any toes if I do some work on it?

No, if there's interest i (or you) can upload it to CVS as well- it's just not really geared towards working with the current php code (would only return plain xml which php would need to re-parse). It's also supporting big parts of the current Moin syntax additional to the MediaWiki one, and integrating the parser with the Moin framework is what i'd like to do after getting the dom processing working. Kind of another wiki supporting the MediaWiki syntax (yes, with some MW features missing), but smaller code and a cleaner architecture. The planned and currently-being-worked on Moin framework promises many small components tied together by events and interfaces which makes it easy to extend/change things without hacking the main code. Current Moin is also faster than MediaWiki, mainly due to mod_python's long-running processes (little Setup.php-like overhead) and an efficient pickle-based caching system.

-- Gabriel Wicke

7435

Age (days ago)

7452

Last active (days ago)

wikitech-l@lists.wikimedia.org

36 comments

12 participants

tags (0)

participants (12)

Ashar Voultoiz
Brion Vibber
evan＠wikitravel.org
Gabriel Wicke
Ivan Krstic
Jan Hidders
Jan Hidders
Jens Frank
Krzysztof Kowalczyk
Magnus Manske
Tim Starling
Timwi