Norbert Kurz wrote:
Hello,
my name is Norbert Kurz and I am a student of applied computer science
in Germany.
I downloaded the 7.8GB XML dump of the german wikipedia and splittet it
into article files.
Now I wanted to parse the Text in the text tag (<text>) into an html page,
my Problem is, that there is a special syntax for tables, lists, links etc.
My question is:
Is there a definition of the XML syntax, so it is easily possible to write a
XML to HTML script?
Is there a file that descripes all of these special cases and the latex
stuff written in the XML files ( \longrightarrow ) and the tables?
Now I want to thank you all for your great work, I am happy that you
make the effort to export the whole wikipedia, so other people
can download it and play around. Please keep up your good work.
Thanks in advance for your help.
Best regards
Norbert Kurz, Stuttgart Germany
The syntax is the same as Wikipedia (MediaWiki wikitext). No, there's no
syntax in computable form (as in ANTLR). It's defined as "whatever the
parser outputs". There are help pages, though.
What you should do is to download MediaWiki and use the
maintenance/renderDump.php script to render the whole dump. That will
take a very long time.
Note that if you want to have a "local Wikipedia copy" there are also
other options available without having to prerender everything.