[WikiEN-l] Using english-Wikipedia XML dump
akhil1988
akhilanger at gmail.com
Sat Jun 27 21:24:13 UTC 2009
Hi All!
Here's a newbie to this forum.
I am looking for some references to help me use Wikipedia XML dump.
Here's what I have to do with the XML dump:
I will set up a server on which people can browse Wikipedia articles and
also a processed version of the corresponding Wikipedia article. By
processed version means a wikipedia article with some additional information
with each line. eg
A line in a Wikipedia article (http://en.wikipedia.org/wiki/Chicago) goes
as:
Chicago (pronounced /ʃɨˈkɑːɡoʊ/ or /ʃɨˈkɔːɡoʊ/) is the largest city in the
U.S. state of Illinois, and with over 2.8 million people is the third
largest city in the country.
My processed version of wikipedia page would be like this:
Chicago (pronounced /ʃɨˈkɑːɡoʊ/ or /ʃɨˈkɔːɡoʊ/) is the largest city in the
U.S. state of Illinois, and with over 2.8 million people is the third
largest city in the country. <Some additional information about this line>
Dont bother about "Some additional information about this line". This is
some NLP (natural Language Processing) stuff which processes the line and
generates some additional information about the line.
So, if somebody wants to access the processed version of any Wikipedia
article, he can go to: http://myserver/wiki/processed_Chicago
I hope I am clear what I intend to do with the wikipedia XML dump.
For this I need to know the following things:
1. How should I extract articles from the XML dump, process them by
extracting plain text from them and then insert the processed page back line
by line at the same place in the XML article as before along with the
additional information that will be generated by the NLP stuff.
In this whole process, I want to maintain the look of the wikipedia page as
the original version.
2. How to render a wikipedia page from the XML dump just like as we see in
the online version of the Wikipedia.
3. XML dump does not have images in it, so how will I render images when a
page on my server is accessed.
Any references or ideas in this regard will be greatly appreciated.
Thanks,
Akhil
--
View this message in context: http://www.nabble.com/Using-english-Wikipedia-XML-dump-tp24236727p24236727.html
Sent from the English Wikipedia mailing list archive at Nabble.com.
More information about the WikiEN-l
mailing list