Re: [Wikitech-l] WikiDump Parsing

17 Jan 2007

      This works too, but its slower than mollasses on a cold Utah day ....
:-)
Jeff
Brion Vibber wrote:
...
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Harish TM wrote:
...
I was trying to parse the Wikipedia dumps but unfortunately I find the XML
file that can be downloaded a little hard to parse. I was wondering if there
is a neat way to extract:
                        1. The article title
/mediawiki/page/title
...
                    2. The article content ( without links to articles

in other languages, external links and so on )
The article content *contains* those links, so I guess you mean you want
to parse the text and remove certain elements of it?
...
                    3. The category.

Again, that's part of article text.
...
Also I find that there are a large number of tools that allow one to convert
plain text to media wiki text. What if I want to go the other way and
extract information exactly the way it appears on the wikipedia site.
Run the wiki parser on it.

-- brion vibber (brion @ pobox.com)

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFFrpxXwRnhpk1wk44RArnZAKCe347OtktrffTXbzGgzb0xVNnZOQCeO7sq
MIjjmK5c8Oc4RYQzMExvqHQ=
=jTHV
-----END PGP SIGNATURE-----

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] WikiDump Parsing