Hi All!
Here's a newbie to this forum.
I am looking for some references to help me use Wikipedia XML dump.
Here's what I have to do with the XML dump:
I will set up a server on which people can browse Wikipedia articles and also a processed version of the corresponding Wikipedia article. By processed version means a wikipedia article with some additional information with each line. eg
A line in a Wikipedia article (http://en.wikipedia.org/wiki/Chicago) goes as:
Chicago (pronounced /ʃɨˈkɑːɡoʊ/ or /ʃɨˈkɔːɡoʊ/) is the largest city in the U.S. state of Illinois, and with over 2.8 million people is the third largest city in the country.
My processed version of wikipedia page would be like this:
Chicago (pronounced /ʃɨˈkɑːɡoʊ/ or /ʃɨˈkɔːɡoʊ/) is the largest city in the U.S. state of Illinois, and with over 2.8 million people is the third largest city in the country. <Some additional information about this line>
Dont bother about "Some additional information about this line". This is some NLP (natural Language Processing) stuff which processes the line and generates some additional information about the line.
So, if somebody wants to access the processed version of any Wikipedia article, he can go to: http://myserver/wiki/processed_Chicago
I hope I am clear what I intend to do with the wikipedia XML dump.
For this I need to know the following things:
1. How should I extract articles from the XML dump, process them by extracting plain text from them and then insert the processed page back line by line at the same place in the XML article as before along with the additional information that will be generated by the NLP stuff. In this whole process, I want to maintain the look of the wikipedia page as the original version.
2. How to render a wikipedia page from the XML dump just like as we see in the online version of the Wikipedia.
3. XML dump does not have images in it, so how will I render images when a page on my server is accessed.
Any references or ideas in this regard will be greatly appreciated.
Thanks, Akhil
You might want to ask in the technical forum. Hopefully someone can point you that way, or answer your question here.
Carcharoth
On Sat, Jun 27, 2009 at 10:24 PM, akhil1988akhilanger@gmail.com wrote:
Hi All!
Here's a newbie to this forum.
I am looking for some references to help me use Wikipedia XML dump.
Here's what I have to do with the XML dump:
I will set up a server on which people can browse Wikipedia articles and also a processed version of the corresponding Wikipedia article. By processed version means a wikipedia article with some additional information with each line. eg
A line in a Wikipedia article (http://en.wikipedia.org/wiki/Chicago) goes as:
Chicago (pronounced /ʃɨˈkɑːɡoʊ/ or /ʃɨˈkɔːɡoʊ/) is the largest city in the U.S. state of Illinois, and with over 2.8 million people is the third largest city in the country.
My processed version of wikipedia page would be like this:
Chicago (pronounced /ʃɨˈkɑːɡoʊ/ or /ʃɨˈkɔːɡoʊ/) is the largest city in the U.S. state of Illinois, and with over 2.8 million people is the third largest city in the country. <Some additional information about this line>
Dont bother about "Some additional information about this line". This is some NLP (natural Language Processing) stuff which processes the line and generates some additional information about the line.
So, if somebody wants to access the processed version of any Wikipedia article, he can go to: http://myserver/wiki/processed_Chicago
I hope I am clear what I intend to do with the wikipedia XML dump.
For this I need to know the following things:
- How should I extract articles from the XML dump, process them by
extracting plain text from them and then insert the processed page back line by line at the same place in the XML article as before along with the additional information that will be generated by the NLP stuff. In this whole process, I want to maintain the look of the wikipedia page as the original version.
- How to render a wikipedia page from the XML dump just like as we see in
the online version of the Wikipedia.
- XML dump does not have images in it, so how will I render images when a
page on my server is accessed.
Any references or ideas in this regard will be greatly appreciated.
Thanks, Akhil -- View this message in context: http://www.nabble.com/Using-english-Wikipedia-XML-dump-tp24236727p24236727.h... Sent from the English Wikipedia mailing list archive at Nabble.com.
WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l