Hello, looks like many people have done almost the same thing: I've also begun to write a parser. Mine is written is Python.
It's far away from completition, but I don't want to spend my time writing useless code so I am asking how we want to continue: I don't think it's wise to have four parsers doing actually the same. However, everbody can do what he/she wants, but I personally would rather like to work on one program than doing doubled, useless work. As Magnus Manske already pointed out, such a parser could be the base of several desired tools. We could write a library that could be used by these applications.
The difference between mine program and all the others AFAIK is the point how the wiki data actually get loaded. As someone suggested on meta.wikipedia.org I have written a file called 'raw.php' that retrieves the raw wiki data. In my opinion that's quite useful for offline-editing applications and such things.
You can get my parser, along with my adapted 'Article.php' and the file 'raw.php', here: www.fms-engel.de/buildHTML.tar.bz2 (Sorry for not providing a patch, when I find time, I'll do it)
I don't want to start a language-flamewar, but I probably prefer Magnus Manske's version. C++ is quite fast and there are several GUI libraries one can use for each platform to write nice GUIs. I did not take a look at his program, but I guess it's much more mature than, for example, mine.
I would really like to discuss this topic as I think a parser and the resulting possibilities of having one is one of the most wanted features, at least for me :) There is probably a better solution than having four parsers that do actually the same thing...
Regards, Frithjof
Frithjof Engel wrote:
Hello, looks like many people have done almost the same thing: I've also begun to write a parser. Mine is written is Python.
It's far away from completition, but I don't want to spend my time writing useless code so I am asking how we want to continue: I don't think it's wise to have four parsers doing actually the same. However, everbody can do what he/she wants, but I personally would rather like to work on one program than doing doubled, useless work. As Magnus Manske already pointed out, such a parser could be the base of several desired tools. We could write a library that could be used by these applications.
The difference between mine program and all the others AFAIK is the point how the wiki data actually get loaded. As someone suggested on meta.wikipedia.org I have written a file called 'raw.php' that retrieves the raw wiki data. In my opinion that's quite useful for offline-editing applications and such things.
You can get my parser, along with my adapted 'Article.php' and the file 'raw.php', here: www.fms-engel.de/buildHTML.tar.bz2 (Sorry for not providing a patch, when I find time, I'll do it)
I don't want to start a language-flamewar, but I probably prefer Magnus Manske's version. C++ is quite fast and there are several GUI libraries one can use for each platform to write nice GUIs. I did not take a look at his program, but I guess it's much more mature than, for example, mine.
I would really like to discuss this topic as I think a parser and the resulting possibilities of having one is one of the most wanted features, at least for me :) There is probably a better solution than having four parsers that do actually the same thing...
Regards, Frithjof
I would imagine that a formal grammar for Wikipedia markup (for example using EBNF) might be a good thing.
This could then be used for three purposes: * to define the grammar clearly for technical reference purposes * to allow the generation of parsers using parser-generator compilers which are available for a large number of languages, including C/C++, Python, Java etc. * to help define a Document Object Model for the output of the parser * to allow parsers to be validated as correct, by allowing the compilation of a set of unit tests
However, what makes this difficult is that there should no invalid documents in Wiki-markup: everything should produce some output, even if it's partly broken: for example, opening a link body, but not closing it, should end up with literal "[[" in the text, not a parser error. Another way of putting it is that _all_ strings should be valid productions of the grammar: however, done naively, this can end up with an ambiguous grammar where the same input can be parsed two ways.
Has anyone made an attempt at defining a formal grammar for Wikipedia?
-- Neil
On Mon, Oct 13, 2003 at 09:05:14PM +0100, Neil Harris wrote:
I would imagine that a formal grammar for Wikipedia markup (for example using EBNF) might be a good thing.
This could then be used for three purposes:
- to define the grammar clearly for technical reference purposes
- to allow the generation of parsers using parser-generator compilers
which are available for a large number of languages, including C/C++, Python, Java etc.
- to help define a Document Object Model for the output of the parser
- to allow parsers to be validated as correct, by allowing the
compilation of a set of unit tests
However, what makes this difficult is that there should no invalid documents in Wiki-markup: everything should produce some output, even if it's partly broken: for example, opening a link body, but not closing it, should end up with literal "[[" in the text, not a parser error. Another way of putting it is that _all_ strings should be valid productions of the grammar: however, done naively, this can end up with an ambiguous grammar where the same input can be parsed two ways.
Has anyone made an attempt at defining a formal grammar for Wikipedia?
Plus, make sure it doesn't make these mistakes: http://en.wikipedia.org/wiki/User:Marumari/Wikitext_Rendering_Quirks
On Monday, Oct 13, 2003, at 13:05 US/Pacific, Neil Harris wrote:
Has anyone made an attempt at defining a formal grammar for Wikipedia?
There's been various stuff on meta, such as: http://meta.wikipedia.org/wiki/Wikipedia_lexer
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org