Frithjof Engel wrote:
Hello,
looks like many people have done almost the same thing: I've also
begun to write a parser. Mine is written is Python.
It's far away from completition, but I don't want to spend my time
writing useless code so I am asking how we want to continue:
I don't think it's wise to have four parsers doing actually the same.
However, everbody can do what he/she wants, but I personally
would rather like to work on one program than doing doubled, useless
work.
As Magnus Manske already pointed out, such a parser could be the base
of several desired tools. We could write a library that could be used
by these applications.
The difference between mine program and all the others AFAIK is
the point how the wiki data actually get loaded.
As someone suggested on
meta.wikipedia.org I have written
a file called 'raw.php' that retrieves the raw wiki data.
In my opinion that's quite useful for offline-editing applications
and such things.
You can get my parser, along with my adapted 'Article.php' and the
file 'raw.php', here:
www.fms-engel.de/buildHTML.tar.bz2
(Sorry for not providing a patch, when I find time, I'll do it)
I don't want to start a language-flamewar, but I probably prefer
Magnus Manske's version. C++ is quite fast and there are several
GUI libraries one can use for each platform to write nice GUIs.
I did not take a look at his program, but I guess it's much more mature
than, for example, mine.
I would really like to discuss this topic as I think a parser and the
resulting possibilities of having one is one of the most wanted
features, at least for me :)
There is probably a better solution than having four parsers that do
actually the same thing...
Regards,
Frithjof
I would imagine that a formal grammar for Wikipedia markup (for example
using EBNF) might be a good thing.
This could then be used for three purposes:
* to define the grammar clearly for technical reference purposes
* to allow the generation of parsers using parser-generator compilers
which are available for a large number of languages, including C/C++,
Python, Java etc.
* to help define a Document Object Model for the output of the parser
* to allow parsers to be validated as correct, by allowing the
compilation of a set of unit tests
However, what makes this difficult is that there should no invalid
documents in Wiki-markup: everything should produce some output, even if
it's partly broken: for example, opening a link body, but not closing
it, should end up with literal "[[" in the text, not a parser error.
Another way of putting it is that _all_ strings should be valid
productions of the grammar: however, done naively, this can end up with
an ambiguous grammar where the same input can be parsed two ways.
Has anyone made an attempt at defining a formal grammar for Wikipedia?
-- Neil