Norbert Kurz wrote:
Hello,
my name is Norbert Kurz and I am a student of applied computer science in Germany.
I downloaded the 7.8GB XML dump of the german wikipedia and splittet it into article files.
Now I wanted to parse the Text in the text tag (<text>) into an html page, my Problem is, that there is a special syntax for tables, lists, links etc.
My question is: Is there a definition of the XML syntax, so it is easily possible to write a XML to HTML script?
Is there a file that descripes all of these special cases and the latex stuff written in the XML files ( \longrightarrow ) and the tables?
Now I want to thank you all for your great work, I am happy that you make the effort to export the whole wikipedia, so other people can download it and play around. Please keep up your good work.
Thanks in advance for your help.
Best regards
Norbert Kurz, Stuttgart Germany
The syntax is the same as Wikipedia (MediaWiki wikitext). No, there's no syntax in computable form (as in ANTLR). It's defined as "whatever the parser outputs". There are help pages, though.
What you should do is to download MediaWiki and use the maintenance/renderDump.php script to render the whole dump. That will take a very long time. Note that if you want to have a "local Wikipedia copy" there are also other options available without having to prerender everything.