[Xmldatadumps-l] wikipedia XML dump, the text tag

Norbert Kurz deinbert at googlemail.com
Tue Aug 10 12:21:36 UTC 2010


Hello,

my name is Norbert Kurz and I am a student of applied computer science in
Germany.

I downloaded the 7.8GB XML dump of the german wikipedia and splittet it into
article files.

Now I wanted to parse the Text in the text tag (<text>) into an html page,
my Problem is, that there is a special syntax for tables, lists, links etc.

My question is:
Is there a definition of the XML syntax, so it is easily possible to write a
XML to HTML script?

E.g.
Zu den Regisseuren, die das Pseudonym benutzt haben, gehören:
* [[Don Siegel]] und [[Robert Totten]] (für [[Frank Patch – Deine Stunden
sind gezählt]]),
* [[David Lynch]] (für die dreistündige Fernsehfassung von [[Der
Wüstenplanet (Film)|Der Wüstenplanet]]),
* [[Chris Christensen]] (The Omega Imperative),
* [[Stuart Rosenberg]] (für [[Let’s Get Harry]]),
* [[Richard C. Sarafian]] (für [[Starfire]]),
* [[Dennis Hopper]] (für [[Catchfire]]),
* [[Arthur Hiller]] (für [[An Alan Smithee Film: Burn Hollywood Burn]]),
* [[Rick Rosenthal]] (Birds II) und
* [[Kevin Yagher]] ([[Hellraiser IV – Bloodline]]).
* Der Pilotfilm der Serie [[MacGyver]] führt einen Alan Smithee als
Regisseur &lt;ref&gt;http://www.imdb.com/title/tt0165375/ &lt;/ref&gt;

The asterix means, that there is a list,
the two brackets [[ means, that there is a link
the pipe: [[ LINKNAME | SHOWN_NAME ]]

Is there a file that descripes all of these special cases and the latex
stuff written in the XML files ( \longrightarrow ) and the tables?

Now I want to thank you all for your great work, I am happy that you make
the effort to export the whole wikipedia, so other people
can download it and play around. Please keep up your good work.

Thanks in advance for your help.

Best regards

Norbert Kurz, Stuttgart Germany
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikimedia.org/pipermail/xmldatadumps-l/attachments/20100810/0a20ecc7/attachment.htm 


More information about the Xmldatadumps-l mailing list