Hello,
my name is Norbert Kurz and I am a student of applied computer science in Germany.
I downloaded the 7.8GB XML dump of the german wikipedia and splittet it into article files.
Now I wanted to parse the Text in the text tag (<text>) into an html page, my Problem is, that there is a special syntax for tables, lists, links etc.
My question is: Is there a definition of the XML syntax, so it is easily possible to write a XML to HTML script?
E.g. Zu den Regisseuren, die das Pseudonym benutzt haben, gehören: * [[Don Siegel]] und [[Robert Totten]] (für [[Frank Patch – Deine Stunden sind gezählt]]), * [[David Lynch]] (für die dreistündige Fernsehfassung von [[Der Wüstenplanet (Film)|Der Wüstenplanet]]), * [[Chris Christensen]] (The Omega Imperative), * [[Stuart Rosenberg]] (für [[Let’s Get Harry]]), * [[Richard C. Sarafian]] (für [[Starfire]]), * [[Dennis Hopper]] (für [[Catchfire]]), * [[Arthur Hiller]] (für [[An Alan Smithee Film: Burn Hollywood Burn]]), * [[Rick Rosenthal]] (Birds II) und * [[Kevin Yagher]] ([[Hellraiser IV – Bloodline]]). * Der Pilotfilm der Serie [[MacGyver]] führt einen Alan Smithee als Regisseur <ref>http://www.imdb.com/title/tt0165375/ </ref>
The asterix means, that there is a list, the two brackets [[ means, that there is a link the pipe: [[ LINKNAME | SHOWN_NAME ]]
Is there a file that descripes all of these special cases and the latex stuff written in the XML files ( \longrightarrow ) and the tables?
Now I want to thank you all for your great work, I am happy that you make the effort to export the whole wikipedia, so other people can download it and play around. Please keep up your good work.
Thanks in advance for your help.
Best regards
Norbert Kurz, Stuttgart Germany
Norbert Kurz wrote:
Hello,
my name is Norbert Kurz and I am a student of applied computer science in Germany.
I downloaded the 7.8GB XML dump of the german wikipedia and splittet it into article files.
Now I wanted to parse the Text in the text tag (<text>) into an html page, my Problem is, that there is a special syntax for tables, lists, links etc.
My question is: Is there a definition of the XML syntax, so it is easily possible to write a XML to HTML script?
Is there a file that descripes all of these special cases and the latex stuff written in the XML files ( \longrightarrow ) and the tables?
Now I want to thank you all for your great work, I am happy that you make the effort to export the whole wikipedia, so other people can download it and play around. Please keep up your good work.
Thanks in advance for your help.
Best regards
Norbert Kurz, Stuttgart Germany
The syntax is the same as Wikipedia (MediaWiki wikitext). No, there's no syntax in computable form (as in ANTLR). It's defined as "whatever the parser outputs". There are help pages, though.
What you should do is to download MediaWiki and use the maintenance/renderDump.php script to render the whole dump. That will take a very long time. Note that if you want to have a "local Wikipedia copy" there are also other options available without having to prerender everything.
xmldatadumps-l@lists.wikimedia.org