Does anyone have a requirements document for the Wikipedia parser?
If not, will those programmers who have already begun work on such a parser, like Magnus and Frithjof, please send me any scraps of documentation you have?
I would like to assemble these into a wiki grammar or something like that. So we can help each other with parser development.
I guess the list of "stupid parser tricks" would start with bracket notation for links:
[http://www.edpoor.com] is a link to my outdated, static website
[http://www.edpoor.com/images/Ae-inAndDog.jpg girl with dog] is an annotated link
[[Iraq]] links to the Wikipedia article on Iraq [[Iraq|Rummyland]] links to Iraq but is shown as "Rummyland" (a Doonesbury reference, okay? ;-)
Etc.
Along with parsing rules for the rendering of text, is the problem of fetching and posting files. That is, coordinating each user's off-line stash (cache?) with the database. Note that some users might not want the entire encyclopedia, but perhaps only those articles they're working on. Or articles one click away?
Ed Poor
On Mon, 13 Oct 2003, Poor, Edmund W wrote:
I guess the list of "stupid parser tricks" would start with bracket notation for links:
[http://www.edpoor.com] is a link to my outdated, static website
[http://www.edpoor.com/images/Ae-inAndDog.jpg girl with dog] is an annotated link
[[Iraq]] links to the Wikipedia article on Iraq [[Iraq|Rummyland]] links to Iraq but is shown as "Rummyland" (a Doonesbury reference, okay? ;-)
Isn't the "Editing help" page already a parser guide? At least, that's what I used writing mine :-)
Ciao, Alfio
Alfio Puglisi wrote:
On Mon, 13 Oct 2003, Poor, Edmund W wrote:
I guess the list of "stupid parser tricks" would start with bracket notation for links:
[http://www.edpoor.com] is a link to my outdated, static website
[http://www.edpoor.com/images/Ae-inAndDog.jpg girl with dog] is an annotated link
[[Iraq]] links to the Wikipedia article on Iraq [[Iraq|Rummyland]] links to Iraq but is shown as "Rummyland" (a Doonesbury reference, okay? ;-)
Isn't the "Editing help" page already a parser guide? At least, that's what I used writing mine :-)
Same here :-)
BTW, a few thoughts: * We should get rid of the ";xxx :yyy" thing ;-) * What to do with wiki-HTML-mixes (e.g., "This '''is bold</b>")? * Do we have to hack HTML validation like we did in PHP, or is there some library we could use?
Magnus
Magnus Manske wrote:
Alfio Puglisi wrote:
On Mon, 13 Oct 2003, Poor, Edmund W wrote:
I guess the list of "stupid parser tricks" would start with bracket notation for links:
[http://www.edpoor.com] is a link to my outdated, static website
[http://www.edpoor.com/images/Ae-inAndDog.jpg girl with dog] is an annotated link
[[Iraq]] links to the Wikipedia article on Iraq [[Iraq|Rummyland]] links to Iraq but is shown as "Rummyland" (a Doonesbury reference, okay? ;-)
Isn't the "Editing help" page already a parser guide? At least, that's what I used writing mine :-)
Same here :-)
Me, too :)
- What to do with wiki-HTML-mixes (e.g., "This '''is bold</b>")?
Hmm, hard to say. The need for real specs get clearer and clearer for me the more I think about it.
- Do we have to hack HTML validation like we did in PHP, or is there
some library we could use?
I think this depends on the language we use.
-- Frithjof
Poor, Edmund W wrote:
Does anyone have a requirements document for the Wikipedia parser?
If not, will those programmers who have already begun work on such a parser, like Magnus and Frithjof, please send me any scraps of documentation you have?
Are you writing this document on Meta? I'll help :)
I guess the list of "stupid parser tricks" would start with bracket notation for links:
[http://www.edpoor.com] is a link to my outdated, static website
[http://www.edpoor.com/images/Ae-inAndDog.jpg girl with dog] is an annotated link
[[Iraq]] links to the Wikipedia article on Iraq [[Iraq|Rummyland]] links to Iraq but is shown as "Rummyland" (a Doonesbury reference, okay? ;-)
Actually, it's been suggested to use double brackets for both types of link.
There seems to be a lot of disjoint discussion on Meta about this. Viz:
* There is work that has been done by Taw on an OCAML lexer at http://meta.wikipedia.org/wiki/Wikipedia_lexer * There are some links at http://meta.wikipedia.org/wiki/Wikitext_syntax * A proposal for a radically different Wiki text language at http://meta.wikipedia.org/wiki/Wikitax * A brief take at http://meta.wikipedia.org/wiki/Wiki_markup_syntax * A nearly content-free page at http://meta.wikipedia.org/wiki/Wiki_syntax * A draft XML syntax of Wikitext at http://meta.wikipedia.org/wiki/Wikipedia_DTD
Clearly there needs to be some kind of centralized place for work on formalizing the language. I would suggest the recently-created http://meta.wikipedia.org/wiki/Wikitext_standard
Right now what we should work on, is like Ed says, to describe and formalize a 1.0 version of the Wikitext language, based on what is used currently. In other words this work should not (for right now) involve incorporating improvements or changes to the Wikitext language.
Moving on...
First, a couple issues of nomenclature that we should probably get out the way:
(1) We need to decide on a name for the wiki markup language or Wiki text. I would advocate calling the language "Wikitext" (and calling it "The Wikitext language" when usage might be ambiguous, like "C or "The C Language"). This seems to be common usage.
(2) A program that converts Wikitext to HTML really consists of three (at this point, entirely theoretical) parts: the lexical analyzer, the parser, and the (HTML) code generator. Of course, our language is so simple and the output language so similar to the input that these steps are basically all rolled into one. Nevertheless, calling the whole system a 'parser' is not strictly correct. I think 'translator' is more accurate, at least from a CS persceptive. I will use the name "Wikitext to HTML translator" unless someone comes up with something better.
In addition to a formalization of the language, we also need a *reference* implementation of a Wikitext to HTML translator. Right now what we have is a de facto reference translator: the functions in OutputPage.php. I think most would agree that they're not an ideal implementation, but right now, it's the only (proven) complete and working implementation of a translator.
The current translator has the following flaws practical and theoretical flaws:
(1) It is a little buggy, and, as Neil R. pointed out, there are some http://en.wikipedia.org/wiki/User:Marumari/Wikitext_Rendering_Quirks. (2) It is written in PHP, which is a relatively show scripting language. (3) It works mainly by using regular expression search-and-replace, which can be wildly ineffecient. (4) And from a theoretical standpoint, it isn't based on any formally declared reference grammar for Wikitext, leading to (1).
The ideal translator will: (1) be written so that is is very efficient, either in PHP or a compiled language like C or C++. (2) Be portable and embeddable in a variety of language environments. (3) Be an example of well-written code generally.
Other thoughts I couldn't find a good place for above:
* A translator written using Lex and Yacc would be a C translator, as that is the output language of those tools. I think using Lex and Yacc or similar tools would be a good approach because it would mean making alterations to the language relatively easy to implement.
* The SWIG interface compiler http://www.swig.org can be used to compile C or C++ directly into PHP and can be called with normal PHP function calls. If a C or C++ translator is used and efficiency of the translator becomes a major performance concern, then using SWIG to compile the translator directly into PHP would be probably the most efficient way to use it. Swig can also compile C and C++ into modules for Perl, Python, Tcl, Ruby, Java, and some other languages.
* Obviously, for usability purposes, we have decided not to use a XML-compatible language. That is fine. However, given the ubiquity of XML and tools to manipulate it, I think it is desirable to have a canonical translation between Wikitext and XML. Having a XML translation of Wikitext would allow better interoperation between Wikitext documents and other systems. Also, the conversion from XML->HTML could be handled by standardized software and technologies, like XSLT. I recognize current implementations of these standards are lacking in some areas, but in the long term they may be the best solution. For now, I think, there is no reason not to just focus on making a good Wikitext to HTML translator.
* We can have a competition of sorts to pick the best implementation of a Wikitext->HTML translator and declare that the 1.0 reference translator.
* As Neil H. said, there should be a way for translators "to be validated as correct, by allowing the compilation of a set of unit tests"
I will put most of this content on meta, but I thought I should post it to the mailing list to stir up interest in a way that can be put to good use.
- David [[User: Nohat]]
On Mon, 13 Oct 2003 17:21:15 -0400, David Friedland david@nohat.net gave utterance to the following:
There seems to be a lot of disjoint discussion on Meta about this. Viz:
- There is work that has been done by Taw on an OCAML lexer at http://meta.wikipedia.org/wiki/Wikipedia_lexer
- There are some links at http://meta.wikipedia.org/wiki/Wikitext_syntax
- A proposal for a radically different Wiki text language at http://meta.wikipedia.org/wiki/Wikitax
- A brief take at http://meta.wikipedia.org/wiki/Wiki_markup_syntax
- A nearly content-free page at http://meta.wikipedia.org/wiki/Wiki_syntax
- A draft XML syntax of Wikitext at http://meta.wikipedia.org/wiki/Wikipedia_DTD
Clearly there needs to be some kind of centralized place for work on formalizing the language. I would suggest the recently-created http://meta.wikipedia.org/wiki/Wikitext_standard
Right now what we should work on, is like Ed says, to describe and formalize a 1.0 version of the Wikitext language, based on what is used currently. In other words this work should not (for right now) involve incorporating improvements or changes to the Wikitext language.
Moving on...
First, a couple issues of nomenclature that we should probably get out the way:
(1) We need to decide on a name for the wiki markup language or Wiki text. I would advocate calling the language "Wikitext" (and calling it "The Wikitext language" when usage might be ambiguous, like "C or "The C Language"). This seems to be common usage.
My suggestions would be "the broken wikitext language", or the "invalid wikitext language". Because of its UseMod ancestry, the current parser produces some very bad HTML code*, and in particular handles lists and nesting of blocks really badly. * not so bad if HTML 3.2 or 4 is our target, but it would be nice to be able to produce clean XHTML. A few months back I started work on a ValidWiki parser, which has a much stronger concept of block and line elements, and uses both block and line stacks to open and close all elements correctly. I think I'm about 2/3 of the way through the block parser, and hadn't yet written the line parser. I have no idea how the code would comapre for efficiency. Unfortunately the only language I know how to code in is MivaScript, so it would need porting. (Miva performs okay for your mid-level merchant application, but doesn't have the efficiency for something with the workload of Wikipedia.
On Tue, Oct 14, 2003 at 11:16:19AM +1300, Richard Grevers wrote:
On Mon, 13 Oct 2003 17:21:15 -0400, David Friedland david@nohat.net gave utterance to the following:
There seems to be a lot of disjoint discussion on Meta about this. Viz:
- There is work that has been done by Taw on an OCAML lexer at http://meta.wikipedia.org/wiki/Wikipedia_lexer
My suggestions would be "the broken wikitext language", or the "invalid wikitext language". Because of its UseMod ancestry, the current parser produces some very bad HTML code*, and in particular handles lists and nesting of blocks really badly.
- not so bad if HTML 3.2 or 4 is our target, but it would be nice to be
able to produce clean XHTML. A few months back I started work on a ValidWiki parser, which has a much stronger concept of block and line elements, and uses both block and line stacks to open and close all elements correctly. I think I'm about 2/3 of the way through the block parser, and hadn't yet written the line parser. I have no idea how the code would comapre for efficiency. Unfortunately the only language I know how to code in is MivaScript, so it would need porting. (Miva performs okay for your mid-level merchant application, but doesn't have the efficiency for something with the workload of Wikipedia.
Uhm, my parser has block stack + line stack architecture too. But the sources at http://meta.wikipedia.org/wiki/Wikipedia_lexer aren't the most recent.
Newer sources attached.
It's not complete but it wasn't really meant to be. It was meant to be a proof of concept that a mix of wiki markup and HTML can be parsed in a XHTML-correct and DWIM way extremely efficiently. Concept proven, but integrating the parser with the rest of Wikipedia would take much more time than I'm willing to spend right now.
BTW, my Wikitext parser is also up at:
http://meta.wikipedia.org/wiki/Wikipedia_flexer
It's fairly old, and not quite complete, but otherwise quite good.
wikitech-l@lists.wikimedia.org