This works too, but its slower than mollasses on a cold Utah day ....
:-)
Jeff
Brion Vibber wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Harish TM wrote:
I was trying to parse the Wikipedia dumps but unfortunately I find the XML file that can be downloaded a little hard to parse. I was wondering if there is a neat way to extract: 1. The article title
/mediawiki/page/title
2. The article content ( without links to articles
in other languages, external links and so on )
The article content *contains* those links, so I guess you mean you want to parse the text and remove certain elements of it?
3. The category.
Again, that's part of article text.
Also I find that there are a large number of tools that allow one to convert plain text to media wiki text. What if I want to go the other way and extract information exactly the way it appears on the wikipedia site.
Run the wiki parser on it.
- -- brion vibber (brion @ pobox.com)
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFFrpxXwRnhpk1wk44RArnZAKCe347OtktrffTXbzGgzb0xVNnZOQCeO7sq MIjjmK5c8Oc4RYQzMExvqHQ= =jTHV -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l